global functions (convert key to dummy)

source("thesis_functions.R")

data :D

data <- fread("kpopdata.csv")
data <- mutate(data, ArtistType = as.factor(ArtistType),
               ArtistGender = as.factor(ArtistGender),
               ArtistGen = factor(ArtistGen),
               release_date = as.POSIXct(release_date, format = "%m/%d/%y"),
               key = factor(key, levels = 0:11),
               mode = as.factor(mode),
               time_signature = factor(time_signature, levels = c(1,3,4,5)),
               popular = factor(ifelse(popularity >=50, 1, 0)))

Understanding Popularity

General Assumptions: continuous response: popularity score ranging from 0 - 100. mix of categorical and continuous response. the distribution of the variables are not normal, we will check for normality of error terms where appropriate.

Goal: create model for predicting popularity scores

Assessing normality of popularity scores

ggplot(data, aes(x=popularity)) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(data$popularity), sd = sd(data$popularity)), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Popularity") + ylab("Density") +
  ggtitle(paste("Popularity", "Distribution")) +
  theme_bw()

summary(data$popularity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.00   23.00   25.88   39.00   94.00
#produce normal qqplot to assess normality
qqnorm(data$popularity, pch = 1, frame = FALSE)
qqline(data$popularity, col = "red", lwd = 2)

#Skewness score
skewness(data$popularity)
## [1] 0.5317036
kurtosis(data$popularity)
## [1] 2.530694

moderately skewed right.

Square Root Transformation

Try square root transformation makes it more normal. This will help to meet the multiple linear regression assumptions.

ggplot(data, aes(x=popularity^0.5)) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(data$popularity^0.5), sd = sd(data$popularity^0.5)), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Popularity") + ylab("Density") +
  ggtitle(paste("Popularity", "Distribution")) +
  theme_bw()

summary(data$popularity^0.5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.317   4.796   4.679   6.245   9.695
#produce normal qqplot to assess normality
qqnorm(data$popularity^0.5, pch = 1, frame = FALSE)
qqline(data$popularity^0.5, col = "red", lwd = 2)

#Skewness score
skewness(data$popularity^0.5)
## [1] -0.3381829
kurtosis(data$popularity^0.5)
## [1] 2.532572

Multiple Linear Model

select just audio features

kpop <- select(data, popularity, duration, acousticness, danceability, energy, instrumentalness, key, loudness, mode, speechiness, tempo, valence)
kpop0 <- kpop %>% filter(mode == 0)%>% select(-mode)
kpop1 <- kpop %>% filter(mode == 1) %>% select(-mode)

### Kpop mode 0 train and test
smpl.size0 <- floor(0.75*nrow(kpop0))
set.seed(123)
smpl0 <- sample(nrow(kpop0), smpl.size0, replace = FALSE)
train0 <- kpop0[smpl0,]
test0 <- kpop0[-smpl0,]

### Kpop mode 1 train and test
smpl.size1 <- floor(0.75*nrow(kpop1))
set.seed(123)
smpl1 <- sample(nrow(kpop1), smpl.size1, replace = FALSE)
train1 <- kpop1[smpl1,]
test1 <- kpop1[-smpl1,]

For Songs with Mode 0

Multiple Linear Regression Assumption Checking

fit a multiple linear regression model

ml0 <- lm(popularity ~. , data = train0)
summary(ml0)
## 
## Call:
## lm(formula = popularity ~ ., data = train0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.283 -13.161  -1.606  11.489  64.381 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       98.430219   4.992479  19.716  < 2e-16 ***
## duration          -0.090184   0.009317  -9.679  < 2e-16 ***
## acousticness      -3.819305   2.053031  -1.860 0.062924 .  
## danceability      -2.192785   3.285433  -0.667 0.504544    
## energy           -33.971079   3.335048 -10.186  < 2e-16 ***
## instrumentalness -15.674945   4.376253  -3.582 0.000346 ***
## key1              -1.205989   1.499467  -0.804 0.421291    
## key2              -2.729912   1.855982  -1.471 0.141416    
## key3               3.040837   2.075808   1.465 0.143040    
## key4              -4.346089   1.469828  -2.957 0.003129 ** 
## key5              -0.564903   1.426951  -0.396 0.692217    
## key6               1.346808   1.489844   0.904 0.366062    
## key7               0.041229   1.632769   0.025 0.979856    
## key8              -1.573809   1.763610  -0.892 0.372251    
## key9              -0.945472   1.512239  -0.625 0.531872    
## key10             -1.757197   1.556351  -1.129 0.258955    
## key11             -2.418422   1.389503  -1.740 0.081861 .  
## loudness           3.556233   0.202714  17.543  < 2e-16 ***
## speechiness       23.779411   4.747553   5.009 5.75e-07 ***
## tempo             -0.004290   0.012545  -0.342 0.732398    
## valence          -11.884888   1.788630  -6.645 3.51e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.34 on 3483 degrees of freedom
## Multiple R-squared:  0.143,  Adjusted R-squared:  0.138 
## F-statistic: 29.05 on 20 and 3483 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(ml0)

Interpretting the diagnostic plots:

  • Residuals vs. Fitted: The line through this plot is somewhat horizontal but has a polynomial curve like pattern throught the 0 horizontal line. This indicates that there could be a linear relationship.

  • Normal Q-Q: The residuals roughly follow the diagonal Normal QQ line, indicating that the residuals are normally distributed.

  • Scale Location: The red line does not pass through the scale-location graph horizontally, rather in a slight positive direction. Therefore, our residuals indicate a heteroscedasticity problem. In otherwords, we do not satisfy the MLR requirement of equal variance across residuals.

  • Residuals vs. Leverage: There are residuals that reach above standard residual 3, therefore, there are many outliers.

Try again with tranformation to make popularity normal:(to the squareroot)

ml0.sqrt <- lm(popularity ~. , data = train0 %>% mutate(popularity = popularity^0.5))
summary(ml0.sqrt)
## 
## Call:
## lm(formula = popularity ~ ., data = train0 %>% mutate(popularity = popularity^0.5))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9651 -1.2163  0.1704  1.3715  5.2110 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      12.8026945  0.5397564  23.719  < 2e-16 ***
## duration         -0.0097617  0.0010073  -9.691  < 2e-16 ***
## acousticness     -0.1418036  0.2219612  -0.639 0.522952    
## danceability     -0.1381154  0.3552010  -0.389 0.697420    
## energy           -3.9085950  0.3605651 -10.840  < 2e-16 ***
## instrumentalness -2.2443174  0.4731338  -4.744 2.18e-06 ***
## key1             -0.1306867  0.1621132  -0.806 0.420214    
## key2             -0.3754834  0.2006575  -1.871 0.061392 .  
## key3              0.1823478  0.2244238   0.813 0.416551    
## key4             -0.5438271  0.1589089  -3.422 0.000628 ***
## key5             -0.1190871  0.1542733  -0.772 0.440212    
## key6              0.1383520  0.1610728   0.859 0.390433    
## key7             -0.0235720  0.1765251  -0.134 0.893779    
## key8             -0.1566102  0.1906708  -0.821 0.411495    
## key9             -0.1520276  0.1634941  -0.930 0.352505    
## key10            -0.2351475  0.1682632  -1.397 0.162353    
## key11            -0.2909129  0.1502246  -1.937 0.052885 .  
## loudness          0.4225379  0.0219162  19.280  < 2e-16 ***
## speechiness       2.1574574  0.5132766   4.203 2.70e-05 ***
## tempo            -0.0007701  0.0013563  -0.568 0.570207    
## valence          -1.1893401  0.1933758  -6.150 8.60e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.874 on 3483 degrees of freedom
## Multiple R-squared:  0.1594, Adjusted R-squared:  0.1546 
## F-statistic: 33.03 on 20 and 3483 DF,  p-value: < 2.2e-16

assumption checking and diagnostics

par(mfrow = c(2,2))
plot(ml0.sqrt)

Interpretting the diagnostic plots:

  • Residuals vs. Fitted: The line through this plot is somewhat horizontal but still exhibits some curvature. Overall though, we could conclude that there is roughly a linear relationship between the predictors and outcome variables.

  • Normal Q-Q: The residuals roughly follow the diagonal Normal QQ line, indicating that the residuals are normally distributed. In comparison to the original scale of the data, there is slightly more deviance on the upper tail with the points falling below the normal QQ line.

  • Scale Location: The red line is roughly more horizontal. Therefore, our residuals exhibit heteroscedasticity. In other words, the model now satisfies the MLR requirement of equal variance across residuals.

  • Residuals vs. Leverage: Most of the residuals fall with in the standardized residual valuse of -3 and 3, therefore, we have few outliers in the dataset.

Full MLR Model

summary(ml0.sqrt)
## 
## Call:
## lm(formula = popularity ~ ., data = train0 %>% mutate(popularity = popularity^0.5))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9651 -1.2163  0.1704  1.3715  5.2110 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      12.8026945  0.5397564  23.719  < 2e-16 ***
## duration         -0.0097617  0.0010073  -9.691  < 2e-16 ***
## acousticness     -0.1418036  0.2219612  -0.639 0.522952    
## danceability     -0.1381154  0.3552010  -0.389 0.697420    
## energy           -3.9085950  0.3605651 -10.840  < 2e-16 ***
## instrumentalness -2.2443174  0.4731338  -4.744 2.18e-06 ***
## key1             -0.1306867  0.1621132  -0.806 0.420214    
## key2             -0.3754834  0.2006575  -1.871 0.061392 .  
## key3              0.1823478  0.2244238   0.813 0.416551    
## key4             -0.5438271  0.1589089  -3.422 0.000628 ***
## key5             -0.1190871  0.1542733  -0.772 0.440212    
## key6              0.1383520  0.1610728   0.859 0.390433    
## key7             -0.0235720  0.1765251  -0.134 0.893779    
## key8             -0.1566102  0.1906708  -0.821 0.411495    
## key9             -0.1520276  0.1634941  -0.930 0.352505    
## key10            -0.2351475  0.1682632  -1.397 0.162353    
## key11            -0.2909129  0.1502246  -1.937 0.052885 .  
## loudness          0.4225379  0.0219162  19.280  < 2e-16 ***
## speechiness       2.1574574  0.5132766   4.203 2.70e-05 ***
## tempo            -0.0007701  0.0013563  -0.568 0.570207    
## valence          -1.1893401  0.1933758  -6.150 8.60e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.874 on 3483 degrees of freedom
## Multiple R-squared:  0.1594, Adjusted R-squared:  0.1546 
## F-statistic: 33.03 on 20 and 3483 DF,  p-value: < 2.2e-16

The full Multiple Linear Regression model can be written \[ \hat {\sqrt {popularity}} = 12.803 - 0.010{Duration} - 0.142{Acousticness} - 0.138{Danceability} - 3.909{Energy} - 2.244{Instrumentalness} \\ - 0.131{Key1} - 0.375{Key2} + 0.182{Key3} - 0.544{Key4} - 0.119{Key5} + 0.138{Key6} - 0.024{Key7} \\ - 0.157{Key8} - 0.152{Key9} - 0.235{Key10} - 0.291{Key11} + 0.423{Loudness} + 2.157{Speechiness} \\ -0.001{Tempo} - 1.189{Valence} \]

The variables that have significant contribution in predicting the popularity score is Duration, Energy, Instrumentalness, Key4 (E Minor), Loudness, Speechiness, and Valence.

An increase in Duration, Acousticness, Danceability, Energy, Instrumentalness, Tempo, Valence and choosing Key1 (C#/D-flat Minor), Key2 (D Minor), Key4 (E Minor), Key5 (F Minor), Key7 (G Minor), Key8 (G#/A-flat Minor), Key9 (A Minor), Key10 (A#/B-flat Minor), Key11 (B minor) instead of Key0(Cminor) decreases the square root popularity score, on average. An increase in Loudness, Speechiness and choosing Key3 (D#/E-flat Minor) or Key6 (F#/G-flat Minor)over Key0 (C Minor) increases the square root popularity score, on average.

prediction on test data

# prediction on test data
yhat.mlr = predict(ml0.sqrt, newdata = test0 %>% mutate(popularity = popularity^0.5))
# RMSE for test data
error.mlr <- yhat.mlr - test0$popularity^0.5
rmse.mlr <- sqrt(mean(error.mlr^2))
data.frame(
  RMSE = RMSE(yhat.mlr, test0$popularity^0.5),
  R2 = R2(yhat.mlr, test0$popularity^0.5)
)
##       RMSE        R2
## 1 1.863478 0.1332768

Variable Selection: Stepwise 10 fold cross validation

set.seed(123)
train_control <- trainControl(method = "cv", 
                              number = 10) 
bye <- capture.output(mlr.step_kcv <- train(popularity ~. , data = train0 %>% mutate(popularity = popularity^0.5),  
                 method = "lmStepAIC", trControl = train_control)) 
print(mlr.step_kcv)
## Linear Regression with Stepwise Selection 
## 
## 3504 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3153, 3152, 3154, 3154, 3154, 3154, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.878512  0.1528544  1.516643
summary(mlr.step_kcv$finalModel)
## 
## Call:
## lm(formula = .outcome ~ duration + energy + instrumentalness + 
##     key2 + key3 + key4 + key6 + key11 + loudness + speechiness + 
##     valence, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9636 -1.2044  0.1791  1.3736  5.2248 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      12.4039589  0.3759972  32.989  < 2e-16 ***
## duration         -0.0097167  0.0009949  -9.766  < 2e-16 ***
## energy           -3.8007377  0.3157733 -12.036  < 2e-16 ***
## instrumentalness -2.2440721  0.4710858  -4.764 1.98e-06 ***
## key2             -0.2588154  0.1619102  -1.599   0.1100    
## key3              0.3053371  0.1903552   1.604   0.1088    
## key4             -0.4241428  0.1056953  -4.013 6.12e-05 ***
## key6              0.2617396  0.1088858   2.404   0.0163 *  
## key11            -0.1689374  0.0921533  -1.833   0.0669 .  
## loudness          0.4214141  0.0217796  19.349  < 2e-16 ***
## speechiness       2.0999791  0.5079769   4.134 3.65e-05 ***
## valence          -1.2328499  0.1723401  -7.154 1.03e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.873 on 3492 degrees of freedom
## Multiple R-squared:  0.1586, Adjusted R-squared:  0.156 
## F-statistic: 59.84 on 11 and 3492 DF,  p-value: < 2.2e-16

The Stepwise 10 fold Cross Validation Multiple Linear Regression model can be written: \[ \hat {\sqrt {popularity}} = 12.404 - 0.010{Duration} - 3.801{Energy} - 2.244{Instrumentalness} - 0.259{Key2} + 0.305{Key3} - 0.424{Key4} \\ + 0.262{Key6} - 0.169{Key11} + 0.421{Loudness} + 2.100{Speechiness} - 1.189{Valence} \]

Compared to the full Multiple Linear Regression Model, the Stepwise method removes the variables: Acousticness, Danceability, Key1, Key5, Key7 - Key10, and Tempo.

The variables that have significant contribution in predicting the popularity score is Duration, Energy, Instrumentalness, Key4 (E Minor), Key6 (F#/G-flat Minor), Loudness, Speechiness, and Valence.

An increase in Duration, Energy, Instrumentalness, Valence and choosing, Key2 (D Minor), Key4 (E Minor), Key5 (F Minor), , Key11 (B minor) instead of Key0(Cminor) decreases the square root popularity score, on average. An increase in Loudness, Speechiness and choosing Key 3 (D#/E-flat Minor) or Key6 (F#/G-flat Minor) over Key0 (C Minor) increases the square root popularity score, on average.

prediction on test data

# prediction on test data
yhat.step_kcv = predict(mlr.step_kcv$finalModel, newdata=key.dummy(test0) %>% mutate(popularity = popularity^0.5))
# RMSE for test data
error.step_kcv <- yhat.step_kcv - test0$popularity^0.5
rmse.step_kcv <- sqrt(mean(error.step_kcv^2))
data.frame(
  RMSE = RMSE(yhat.step_kcv, test0$popularity^0.5),
  R2 = R2(yhat.step_kcv, test0$popularity^0.5)
)
##       RMSE        R2
## 1 1.861179 0.1350903

Both of the models have RMSE scores of around 2. This isn't terrible since the range of the transformed popularity scores is 0-10.

Regularized Regression: Ridge

lm.ridge0 <- train(popularity ~. , data = train0 %>% mutate(popularity = popularity^0.5), method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.ridge0$bestTune
##    alpha lambda
## 47     0  0.047
# best coefficient
lm.ridge0.model <- coef(lm.ridge0$finalModel, lm.ridge0$bestTune$lambda)
lm.ridge0.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)      12.3411323068
## duration         -0.0096041085
## acousticness     -0.0662308488
## danceability     -0.1493986374
## energy           -3.5601398635
## instrumentalness -2.2691735701
## key1             -0.0831090967
## key2             -0.3241117452
## key3              0.2260133231
## key4             -0.4931505942
## key5             -0.0698858769
## key6              0.1837187046
## key7              0.0173666066
## key8             -0.1059775161
## key9             -0.1084064701
## key10            -0.1844827951
## key11            -0.2447391761
## loudness          0.3990852347
## speechiness       2.0404769494
## tempo            -0.0007694543
## valence          -1.1748730700

The Ridge Multiple Linear Regularized Regression model can be written \[ \hat {\sqrt {popularity}} = 12.341 - 0.010{Duration} - 0.066{Acousticness} - 0.149{Danceability} - 3.560{Energy} - 2.269{Instrumentalness} \\ - 0.083{Key1} - 0.324{Key2} + 0.226{Key3} - 0.493{Key4} - 0.070{Key5} + 0.184{Key6} + 0.017{Key7} \\ - 0.106{Key8} - 0.108{Key9} - 0.184{Key10} - 0.245{Key11} + 0.399{Loudness} + 2.040{Speechiness} \\ -0.001{Tempo} - 1.175{Valence} \]

An increase in Duration, Acousticness, Danceability, Energy, Instrumentalness, Tempo, Valence and choosing Key1 (C#/D-flat Minor), Key2 (D Minor), Key4 (E Minor), Key5 (F Minor), Key8 (G#/A-flat Minor), Key9 (A Minor), Key10 (A#/B-flat Minor), Key11 (B minor) instead of Key0(Cminor) decreases the square root of the popularity score, on average. An increase in Loudness, Speechiness and choosing Key3 (D#/E-flat Minor), Key6 (F#/G-flat Minor), or Key7 (G Minor)over Key0 (C Minor) increases the square root of the popularity score, on average.

In comparison to the Full MLR model, the ridge shrinks the coefficients closer to zero, and the coefficient of dummy variable Key7 flipped signs from having a negative coefficient to a positive coefficient.

# prediction on test data
yhat.ridge0 = predict(lm.ridge0, s=lm.ridge0$bestTune,newdata=test0%>% mutate(popularity = popularity^0.5))
# RMSE for test data
data.frame(
  RMSE = RMSE(yhat.ridge0, test0$popularity^0.5),
  R2 = R2(yhat.ridge0, test0$popularity^0.5)
)
##       RMSE        R2
## 1 1.861093 0.1339601

Regularized Regression: Lasso

lm.lasso0 <- train(popularity ~. , data = train0 %>% mutate(popularity = popularity^0.5), method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.lasso0$bestTune
##    alpha lambda
## 17     1  0.017
# best coefficient
lm.lasso0.model <- coef(lm.lasso0$finalModel, lm.lasso0$bestTune$lambda)
lm.lasso0.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)      11.926081916
## duration         -0.009219647
## acousticness      .          
## danceability      .          
## energy           -3.509181259
## instrumentalness -2.058245554
## key1              .          
## key2             -0.161479016
## key3              0.216989939
## key4             -0.363371695
## key5              .          
## key6              0.223544343
## key7              0.035935912
## key8              .          
## key9              .          
## key10            -0.044086831
## key11            -0.120863074
## loudness          0.399346845
## speechiness       1.775570123
## tempo             .          
## valence          -1.140052989

The Lasso Multiple Linear Regularized Regression model can be written \[ \hat {\sqrt {popularity}} = 11.897 - 0.010{Duration} - 3.492{Energy} - 2.047{Instrumentalness} \\ - 0.155{Key2} + 0.212{Key3} - 0.359{Key4} + 0.221{Key6} - 0.033{Key7} - 0.040{Key10} \\ - 0.118{Key11} + 0.398{Loudness} + 1.755{Speechiness} - 1.135{Valence} \]

Compared to the full Multiple Linear Regression model, the Lasso model removes variables: Acousticness, Danceability, Key1, Key5, Key8, Key9, and Tempo

An increase in Duration, Energy, Instrumentalness, Valence and choosing Key2 (D Minor), Key4 (E Minor), Key7 (G Minor), Key10 (A#/B-flat Minor), Key11 (B minor) instead of Key0(Cminor) decreases the square root popularity score, on average. An increase in Loudness, Speechiness and choosing Key3 (D#/E-flat Minor) or Key6 (F#/G-flat Minor)over Key0 (C Minor) increases the square root popularity score, on average.

# prediction on test data
yhat.lasso0 = predict(lm.lasso0, s=lm.lasso0$bestTune,newdata=test0%>% mutate(popularity = popularity^0.5))
# RMSE for test data
error.lasso0 <- yhat.lasso0 - test0$popularity^0.5
rmse.lasso0 <- sqrt(mean(error.lasso0^2))
data.frame(
  RMSE = RMSE(yhat.lasso0, test0$popularity^0.5),
  R2 = R2(yhat.lasso0, test0$popularity^0.5)
)
##       RMSE        R2
## 1 1.858953 0.1351553

Regularized Regression: Elastic Net

lm.enet0 <- train(popularity ~. , data = train0 %>% mutate(popularity = popularity^0.5), method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = seq(0,1,by = 0.05),lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
lm.enet0$bestTune
##      alpha lambda
## 2018     1  0.018
# best coefficient
lm.enet0.model <- coef(lm.enet0$finalModel, lm.enet0$bestTune$lambda)
lm.enet0.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)      11.897380956
## duration         -0.009188124
## acousticness      .          
## danceability      .          
## energy           -3.491742638
## instrumentalness -2.047453789
## key1              .          
## key2             -0.155458931
## key3              0.212147017
## key4             -0.359461075
## key5              .          
## key6              0.221662630
## key7              0.032645442
## key8              .          
## key9              .          
## key10            -0.039857479
## key11            -0.117656255
## loudness          0.398035935
## speechiness       1.754828098
## tempo             .          
## valence          -1.135217969

The Elastic Net Multiple Linear Regression model can be written \[ \hat {\sqrt {popularity}} = 11.982 - 0.010{Duration} - 3.542{Energy} - 2.081{Instrumentalness} - 0.174{Key2} \\ + 0.227{Key3} - 0.371{Key4} + 0.227{Key6}+ 0.043{Key7} - 0.053{Key10} - 0.127{Key11} \\ + 0.402{Loudness} + 1.817{Speechiness}-0.000001{Tempo} - 1.150{Valence} \]

Compare to the Full Multiple linear regression model, the Elastic Net regression removes the variables: Acousticness, Danceability, Key1, Key5, Key8, and Key9. The Key7 Dummy variable also has a positive coefficient rather than the negative coefficient in the full model.

An increase in Duration, Energy, Instrumentalness, Tempo, Valence and choosing Key2 (D Minor), Key4 (E Minor), Key10 (A#/B-flat Minor), Key11 (B minor) instead of Key0(Cminor) decreases the square root popularity score, on average. An increase in Loudness, Speechiness and choosing Key3 (D#/E-flat Minor), Key6 (F#/G-flat Minor) or Key7 (G Minor) instead of Key0 (C Minor) increases the square root popularity score, on average.

# prediction on test data
yhat.enet0 = predict(lm.enet0, s=lm.enet0$bestTune,newdata=test0%>% mutate(popularity = popularity^0.5))
# RMSE for test data
error.enet0 <- yhat.enet0 - test0$popularity^0.5
rmse.enet0 <- sqrt(mean(error.enet0^2))
data.frame(
  RMSE = RMSE(yhat.enet0, test0$popularity^0.5),
  R2 = R2(yhat.enet0, test0$popularity^0.5)
)
##       RMSE        R2
## 1 1.858834 0.1351914

Deciding the best MLR model for mode 0 songs

mlr0.results <- data.frame(Model = c("Full", "Step 10CV", "Ridge 10CV", "Lasso 10CV", "ElasticNet 10CV"),
                           #AIC = c(),
                           adjR2 = c(0.1546, 0.1560, NA, NA,NA),
                           RMSE = c(1.8635, 1.8612,1.8611, 1.8588, 1.8592),
                           Pred.R2 = c( 0.1333, 0.1351, 0.1340, 0.1352, 0.1351))
mlr0.results
##             Model  adjR2   RMSE Pred.R2
## 1            Full 0.1546 1.8635  0.1333
## 2       Step 10CV 0.1560 1.8612  0.1351
## 3      Ridge 10CV     NA 1.8611  0.1340
## 4      Lasso 10CV     NA 1.8588  0.1352
## 5 ElasticNet 10CV     NA 1.8592  0.1351

The model that yields the lowest RMSE and highest R2 from predictions is the Lasso Multiple Linear Regularized Regression model.

The Lasso Multiple Linear Regularized Regression model can be written \[ \hat {\sqrt {popularity}} = 11.897 - 0.010{Duration} - 3.492{Energy} - 2.047{Instrumentalness} \\ - 0.155{Key2} + 0.212{Key3} - 0.359{Key4} + 0.221{Key6} - 0.033{Key7} - 0.040{Key10} \\ - 0.118{Key11} + 0.398{Loudness} + 1.755{Speechiness} - 1.135{Valence} \]

For songs with Mode 1

Assumption checking

ml1 <- lm(popularity ~. , data = train1)
plot(ml1)

ml1.sqrt <- lm(popularity ~. , data = train1 %>% mutate(popularity = popularity^0.5))
plot(ml1.sqrt)

Full Multiple linear model

summary(ml1.sqrt)
## 
## Call:
## lm(formula = popularity ~ ., data = train1 %>% mutate(popularity = popularity^0.5))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4223 -1.2067  0.1313  1.3255  5.2394 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.8571767  0.4012433  27.059  < 2e-16 ***
## duration         -0.0092489  0.0007447 -12.419  < 2e-16 ***
## acousticness      0.0805595  0.1482707   0.543 0.586927    
## danceability      0.0508851  0.2545136   0.200 0.841542    
## energy           -3.0777912  0.2809079 -10.957  < 2e-16 ***
## instrumentalness -1.8772989  0.3529644  -5.319 1.09e-07 ***
## key1              0.3212935  0.0916230   3.507 0.000457 ***
## key2              0.2071664  0.0980728   2.112 0.034699 *  
## key3              0.4306126  0.1503977   2.863 0.004210 ** 
## key4             -0.0291157  0.1272720  -0.229 0.819058    
## key5              0.2042781  0.1117088   1.829 0.067504 .  
## key6              0.2169456  0.1092982   1.985 0.047206 *  
## key7              0.2832696  0.0911450   3.108 0.001894 ** 
## key8              0.2970985  0.1057360   2.810 0.004975 ** 
## key9              0.2708845  0.1134637   2.387 0.017001 *  
## key10             0.0800665  0.1321708   0.606 0.544685    
## key11             0.3899069  0.1176778   3.313 0.000928 ***
## loudness          0.3862563  0.0173867  22.216  < 2e-16 ***
## speechiness       4.5798143  0.4149651  11.037  < 2e-16 ***
## tempo            -0.0014736  0.0009580  -1.538 0.124055    
## valence          -0.8492855  0.1545529  -5.495 4.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.821 on 5483 degrees of freedom
## Multiple R-squared:  0.1499, Adjusted R-squared:  0.1468 
## F-statistic: 48.35 on 20 and 5483 DF,  p-value: < 2.2e-16

The full Multiple Linear Regression model can be written \[ \hat {\sqrt {popularity}} = 10.857 - 0.010{Duration} + 0.081{Acousticness} + 0.051{Danceability} - 3.078{Energy} - 1.877{Instrumentalness} \\ + 0.321{Key1} + 0.207{Key2} + 0.431{Key3} - 0.029{Key4} + 0.204{Key5} + 0.217{Key6} + 0.283{Key7} \\ + 0.297{Key8} + 0.271{Key9} + 0.080{Key10} + 0.390{Key11} + 0.386{Loudness} + 4.580{Speechiness} \\ -0.001{Tempo} - 0.849{Valence} \]

The variables that have significant contribution in predicting the popularity score is Duration, Energy, Instrumentalness, Key1 (C#/D-flat Major), Key2, Key3, Key6, Key7, Key8, Key9, Key11, Loudness, Speechiness, and Valence.

An increase in Duration, Energy, Instrumentalness, Tempo, Valence and choosing Key4 (E Major) instead of Key0(C Major) decreases the square root popularity score, on average. An increase in Acousticness, Danceability, Loudness, Speechiness and choosing Key1 (C#/D-flat Major), Key2 (D Major), Key3 (D#/E-flat Major), Key5 (F Major), Key6 (F#/G-flat Minor), Key7 (G Major), Key8 (G#/A-flat Major), Key9 (A Major), Key10 (A#/B-flat Major),or Key11 (B Major) over Key0 (C Major) increases the square root popularity score, on average.

prediction on test data

# prediction on test data
yhat.mlr = predict(ml1.sqrt, newdata = test1 %>% mutate(popularity = popularity^0.5))
# RMSE for test data
error.mlr <- yhat.mlr - test1$popularity^0.5
rmse.mlr <- sqrt(mean(error.mlr^2))
data.frame(
  RMSE = RMSE(yhat.mlr, test1$popularity^0.5),
  R2 = R2(yhat.mlr, test1$popularity^0.5)
)
##       RMSE        R2
## 1 1.858473 0.1257546

Variable Selection: Stepwise 10 fold cross validation

set.seed(123)
train_control <- trainControl(method = "cv", 
                              number = 10) 
bye <- capture.output(mlr.step_kcv <- train(popularity ~. , data = train1 %>% mutate(popularity = popularity^0.5),  
                 method = "lmStepAIC", trControl = train_control)) 
print(mlr.step_kcv)
## Linear Regression with Stepwise Selection 
## 
## 5504 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4953, 4954, 4954, 4955, 4953, 4953, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.823965  0.1452973  1.468792
summary(mlr.step_kcv$finalModel)
## 
## Call:
## lm(formula = .outcome ~ duration + energy + instrumentalness + 
##     key1 + key2 + key3 + key5 + key6 + key7 + key8 + key9 + key11 + 
##     loudness + speechiness + tempo + valence, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4396 -1.2039  0.1341  1.3270  5.2434 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.9848032  0.3075276  35.720  < 2e-16 ***
## duration         -0.0092575  0.0007364 -12.571  < 2e-16 ***
## energy           -3.1605194  0.2354845 -13.421  < 2e-16 ***
## instrumentalness -1.8844623  0.3524459  -5.347 9.31e-08 ***
## key1              0.3118350  0.0840432   3.710 0.000209 ***
## key2              0.1989538  0.0909328   2.188 0.028717 *  
## key3              0.4250833  0.1449895   2.932 0.003384 ** 
## key5              0.1963190  0.1054161   1.862 0.062611 .  
## key6              0.2098630  0.1030213   2.037 0.041690 *  
## key7              0.2745519  0.0835910   3.284 0.001028 ** 
## key8              0.2885193  0.0992487   2.907 0.003663 ** 
## key9              0.2616042  0.1071561   2.441 0.014664 *  
## key11             0.3831543  0.1116079   3.433 0.000601 ***
## loudness          0.3871730  0.0172524  22.442  < 2e-16 ***
## speechiness       4.5787531  0.4146235  11.043  < 2e-16 ***
## tempo            -0.0015353  0.0009202  -1.668 0.095302 .  
## valence          -0.8365555  0.1361062  -6.146 8.49e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.82 on 5487 degrees of freedom
## Multiple R-squared:  0.1498, Adjusted R-squared:  0.1473 
## F-statistic: 60.42 on 16 and 5487 DF,  p-value: < 2.2e-16

The Stepwise Multiple Linear Regression model can be written: \[ \hat {\sqrt {popularity}} = 10.985 - 0.010{Duration} - 3.161{Energy} - 1.884{Instrumentalness} \\ + 0.312{Key1} + 0.199{Key2} + 0.425{Key3} + 0.196{Key5} + 0.210{Key6} + 0.275{Key7} \\ + 0.289{Key8} + 0.262{Key9} + 0.383{Key11} + 0.387{Loudness} + 4.579{Speechiness} \\ -0.002{Tempo} - 0.837{Valence} \]

Compared to the full multiple linear regression model, the stepwise model removes the variables Acousticness, Danceability, Key4, Key10

The variables that have significant contribution in predicting the popularity score is Duration, Energy, Instrumentalness, Key1 (C#/D-flat Major), Key2, Key3, Key6, Key7, Key8, Key9, Key11, Loudness, Speechiness, and Valence.

An increase in Duration, Energy, Instrumentalness, Tempo, and Valence decreases the square root popularity score, on average. An increase in Acousticness, Danceability, Loudness, Speechiness and choosing Key1 (C#/D-flat Major), Key2 (D Major), Key3 (D#/E-flat Major), Key5 (F Major), Key6 (F#/G-flat Minor), Key7 (G Major), Key8 (G#/A-flat Major), Key9 (A Major), or Key11 (B Major) over Key0 (C Major) increases the square root popularity score, on average.

prediction on test data

# prediction on test data
yhat.step_kcv = predict(mlr.step_kcv$finalModel, newdata=key.dummy(test1) %>% mutate(popularity = popularity^0.5))
# RMSE for test data
error.step_kcv <- yhat.step_kcv - test1$popularity^0.5
rmse.step_kcv <- sqrt(mean(error.step_kcv^2))
data.frame(
  RMSE = RMSE(yhat.step_kcv, test1$popularity^0.5),
  R2 = R2(yhat.step_kcv, test1$popularity^0.5)
)
##       RMSE        R2
## 1 1.858518 0.1257011

Regularized Regression: Ridge

lm.ridge1 <- train(popularity ~. , data = train1 %>% mutate(popularity = popularity^0.5), method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.ridge1$bestTune
##    alpha lambda
## 45     0  0.045
# best coefficient
lm.ridge1.model <- coef(lm.ridge1$finalModel, lm.ridge1$bestTune$lambda)
lm.ridge1.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)      10.347839763
## duration         -0.009063271
## acousticness      0.169898922
## danceability      0.049633951
## energy           -2.620990536
## instrumentalness -1.927744817
## key1              0.293618095
## key2              0.182685249
## key3              0.399431145
## key4             -0.049409156
## key5              0.179350934
## key6              0.188771865
## key7              0.259698734
## key8              0.273385099
## key9              0.249069582
## key10             0.051821098
## key11             0.361138772
## loudness          0.355806469
## speechiness       4.325244621
## tempo            -0.001381351
## valence          -0.858954968

The Ridge Multiple Linear Regularized Regression model can be written:
\[ \hat {\sqrt {popularity}} = 10.348 - 0.010{Duration} + 0.170{Acousticness} + 0.050{Danceability} - 2.621{Energy} - 1.928{Instrumentalness} \\ + 0.294{Key1} + 0.183Key2} + 0.399{Key3} - 0.049{Key4} + 0.179{Key5} + 0.189{Key6} + 0.260{Key7} \\ + 0.273{Key8} + 0.249{Key9} + 0.052{Key10} + 0.361{Key11} + 0.356{Loudness} + 4.325{Speechiness} \\ -0.001{Tempo} - 0.859{Valence} \]

An increase in Duration, Energy, Instrumentalness, Tempo, Valence and choosing Key4 (E Major) instead of Key0(C Major) decreases the square root popularity score, on average. An increase in Acousticness, Danceability, Loudness, Speechiness and choosing Key1 (C#/D-flat Major), Key2 (D Major), Key3 (D#/E-flat Major), Key5 (F Major), Key6 (F#/G-flat Minor), Key7 (G Major), Key8 (G#/A-flat Major), Key9 (A Major), Key10 (A#/B-flat Major),or Key11 (B Major) over Key0 (C Major) increases the square root popularity score, on average.

# prediction on test data
yhat.ridge1 = predict(lm.ridge1, s=lm.ridge1$bestTune,newdata=test1%>% mutate(popularity = popularity^0.5))
# RMSE for test data
data.frame(
  RMSE = RMSE(yhat.ridge1, test1$popularity^0.5),
  R2 = R2(yhat.ridge1, test1$popularity^0.5)
)
##       RMSE        R2
## 1 1.858182 0.1253763

Regularized Regression: Lasso

lm.lasso1 <- train(popularity ~. , data = train1 %>% mutate(popularity = popularity^0.5), method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.lasso1$bestTune
##   alpha lambda
## 3     1  0.003
# best coefficient
lm.lasso1.model <- coef(lm.lasso1$finalModel, lm.lasso1$bestTune$lambda)
lm.lasso1.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)      10.823106069
## duration         -0.009165071
## acousticness      0.078797866
## danceability      0.007432375
## energy           -3.012459438
## instrumentalness -1.855427319
## key1              0.268514415
## key2              0.153367880
## key3              0.368124062
## key4             -0.060812240
## key5              0.148197878
## key6              0.160460653
## key7              0.231766196
## key8              0.242516531
## key9              0.215811156
## key10             0.020377237
## key11             0.334220735
## loudness          0.381054835
## speechiness       4.504126435
## tempo            -0.001394241
## valence          -0.829567095

The Lasso Multiple Linear Regression model can be written \[ \hat {\sqrt {popularity}} = 10.823 - 0.010{Duration} + 0.079{Acousticness} + 0.007{Danceability} - 3.012{Energy} - 1.855{Instrumentalness} \\ + 0.269{Key1} + 0.153{Key2} + 0.368{Key3} - 0.061{Key4} + 0.148{Key5} + 0.160{Key6} + 0.232{Key7} \\ + 0.243{Key8} + 0.216{Key9} + 0.020{Key10} + 0.334{Key11} + 0.381{Loudness} + 4.504{Speechiness} \\ -0.001{Tempo} - 0.830{Valence} \]

Compared to the full multiple linear regression model, all variables are kept in the model with lasso, but the coefficients are shrunk closer to zero.

An increase in Duration, Energy, Instrumentalness, Tempo, Valence and choosing Key4 (E Major) instead of Key0 (C Major) decreases the square root popularity score, on average. An increase in Acousticness, Danceability, Loudness, Speechiness and choosing Key1 (C#/D-flat Major), Key2 (D Major), Key3 (D#/E-flat Major), Key5 (F Major), Key6 (F#/G-flat Minor), Key7 (G Major), Key8 (G#/A-flat Major), Key9 (A Major), Key10 (A#/B-flat Major),or Key11 (B Major) over Key0 (C Major) increases the square root popularity score, on average.

# prediction on test data
yhat.lasso1 = predict(lm.lasso1, s=lm.lasso1$bestTune,newdata=test1%>% mutate(popularity = popularity^0.5))
# RMSE for test data
error.lasso1 <- yhat.lasso1 - test1$popularity^0.5
rmse.lasso1 <- sqrt(mean(error.lasso1^2))
data.frame(
  RMSE = RMSE(yhat.lasso1, test1$popularity^0.5),
  R2 = R2(yhat.lasso1, test1$popularity^0.5)
)
##       RMSE        R2
## 1 1.858393 0.1255584

Regularized Regression: Elastic Net

lm.enet1 <- train(popularity ~. , data = train1 %>% mutate(popularity = popularity^0.5), method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = seq(0,1,by = 0.05),lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
lm.enet1$bestTune
##     alpha lambda
## 507  0.25  0.007
# best coefficient
lm.enet1.model <- coef(lm.enet1$finalModel, lm.enet1$bestTune$lambda)
lm.enet1.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)      10.770793937
## duration         -0.009180508
## acousticness      0.092081531
## danceability      0.026587503
## energy           -2.978939887
## instrumentalness -1.872237837
## key1              0.287645952
## key2              0.173244638
## key3              0.390821662
## key4             -0.049747724
## key5              0.169007388
## key6              0.181032607
## key7              0.250833300
## key8              0.262900268
## key9              0.236604652
## key10             0.042176217
## key11             0.354397091
## loudness          0.379348713
## speechiness       4.503674675
## tempo            -0.001415713
## valence          -0.840584039

The Elastic Net Multiple Linear Regression model can be written \[ \hat {\sqrt {popularity}} = 10.771 - 0.010{Duration} + 0.092{Acousticness} + 0.027{Danceability} - 2.979{Energy} - 1.872{Instrumentalness} \\ + 0.288{Key1} + 0.173{Key2} + 0.391{Key3} - 0.050{Key4} + 0.169{Key5} + 0.181{Key6} + 0.251{Key7} \\ + 0.263{Key8} + 0.237{Key9} + 0.042{Key10} + 0.354{Key11} + 0.379{Loudness} + 4.504{Speechiness} \\ -0.001{Tempo} - 0.841{Valence} \]

An increase in Duration, Energy, Instrumentalness, Tempo, Valence and choosing Key4 (E Major) instead of Key0(C Major) decreases the square root popularity score, on average. An increase in Acousticness, Danceability, Loudness, Speechiness and choosing Key1 (C#/D-flat Major), Key2 (D Major), Key3 (D#/E-flat Major), Key5 (F Major), Key6 (F#/G-flat Minor), Key7 (G Major), Key8 (G#/A-flat Major), Key9 (A Major), Key10 (A#/B-flat Major),or Key11 (B Major) over Key0 (C Major) increases the square root popularity score, on average.

# prediction on test data
yhat.enet1 = predict(lm.enet1, s=lm.enet1$bestTune,newdata=test1%>% mutate(popularity = popularity^0.5))
# RMSE for test data
error.enet1 <- yhat.enet1 - test1$popularity^0.5
rmse.enet1 <- sqrt(mean(error.enet1^2))
data.frame(
  RMSE = RMSE(yhat.enet1, test1$popularity^0.5),
  R2 = R2(yhat.enet1, test1$popularity^0.5)
)
##       RMSE        R2
## 1 1.858296 0.1256477

Deciding the best MLR model for mode 1 songs

mlr1.results <- data.frame(Model = c("Full", "Step 10CV", "Ridge 10CV", "Lasso 10CV", "ElasticNet 10CV"),
                           #AIC = c(),
                           adjR2 = c(0.1468, 0.1453, NA, NA, NA),
                           RMSE = c(1.8585, 1.8585, 1.8582, 1.8584, 1.8583),
                           Pred.R2 = c(0.1258, 0.1257, 0.1254, 0.1256, 0.1256))
mlr1.results
##             Model  adjR2   RMSE Pred.R2
## 1            Full 0.1468 1.8585  0.1258
## 2       Step 10CV 0.1453 1.8585  0.1257
## 3      Ridge 10CV     NA 1.8582  0.1254
## 4      Lasso 10CV     NA 1.8584  0.1256
## 5 ElasticNet 10CV     NA 1.8583  0.1256

The model with the lowest RMSE and highest R2...

Logistic (classification approach)

Use 70% train, 15% validation, 15% test, to use validation for finding optimal cutoff value.

kpop.logit <- select(data, popular, duration, acousticness, danceability, energy, instrumentalness, key, loudness, mode, speechiness, tempo, valence)
logit.kpop0 <- kpop.logit %>% filter(mode == 0)%>% select(-mode)
logit.kpop1 <- kpop.logit %>% filter(mode == 1) %>% select(-mode)

### Kpop mode 0 train and test
# smpl.size0 <- floor(0.75*nrow(logit.kpop0))
# set.seed(123)
# smpl0 <- sample(nrow(logit.kpop0), smpl.size0, replace = FALSE)
# og.logit.train0 <- logit.kpop0[smpl0,]
# og.logit.test0 <- logit.kpop0[-smpl0,]
set.seed(123)
p3 <- partition.3(logit.kpop0, 0.70, 0.15)
logit.train0 <- p3$data.train
logit.valid0 <- p3$data.val
logit.test0 <- p3$data.test
all.train0 <- rbind(logit.train0, logit.valid0)

# ### Kpop mode 1 train and test
# smpl.size1 <- floor(0.75*nrow(logit.kpop1))
# set.seed(123)
# smpl1 <- sample(nrow(logit.kpop1), smpl.size1, replace = FALSE)
# logit.train1 <- logit.kpop1[smpl1,]
# logit.test1 <- logit.kpop1[-smpl1,]
set.seed(123)
p3 <- partition.3(logit.kpop1, 0.70, 0.15)
logit.train1 <- p3$data.train
logit.valid1 <- p3$data.val
logit.test1 <- p3$data.test
all.train1 <- rbind(logit.train1, logit.valid1)

Mode 0 popularity

Full Logistic Model: train/valid/test

fitting logistic model using combo of train/valid/test, finding optimal model using training data.

# Fit logistic model on training data
v.logit.model0 <- glm(popular ~ ., family=binomial(link='logit'),data= logit.train0)

#search for best cutoff
out0 <- opt.cut.func(v.logit.model0, logit.valid0)
opt.cut.plot(out0)

out0$cutoff[which.min(out0$ssdiff.vec)]
## [1] 0.14
v.opt.cut0 <- out0$cutoff[which.max(out0$kappa.vec)]
v.opt.cut0
## [1] 0.18

Fit final model (combo of train and validation)

v.model.final <-  glm(popular ~ ., data=all.train0, family=binomial(link='logit'))
summary(v.model.final)
## 
## Call:
## glm(formula = popular ~ ., family = binomial(link = "logit"), 
##     data = all.train0)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4700  -0.5782  -0.4585  -0.3339   2.6461  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       5.0594524  0.8743667   5.786 7.19e-09 ***
## duration         -0.0086620  0.0016174  -5.355 8.54e-08 ***
## acousticness     -1.4718150  0.3559297  -4.135 3.55e-05 ***
## danceability      0.3111716  0.5329309   0.584   0.5593    
## energy           -3.7232087  0.5626631  -6.617 3.66e-11 ***
## instrumentalness -2.8085598  1.8548502  -1.514   0.1300    
## key1              0.1109165  0.2538918   0.437   0.6622    
## key2             -0.0875524  0.3211703  -0.273   0.7852    
## key3              0.7722710  0.3100048   2.491   0.0127 *  
## key4             -0.3270021  0.2657016  -1.231   0.2184    
## key5              0.1797736  0.2418359   0.743   0.4573    
## key6              0.3135429  0.2457043   1.276   0.2019    
## key7              0.1753454  0.2759348   0.635   0.5251    
## key8              0.2184072  0.2954494   0.739   0.4598    
## key9              0.0318705  0.2590589   0.123   0.9021    
## key10             0.1516816  0.2613415   0.580   0.5616    
## key11             0.0590538  0.2393927   0.247   0.8052    
## loudness          0.3497156  0.0428642   8.159 3.39e-16 ***
## speechiness       3.4530879  0.6920940   4.989 6.06e-07 ***
## tempo            -0.0008802  0.0020104  -0.438   0.6615    
## valence          -1.6385761  0.2894790  -5.660 1.51e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3102.1  on 3971  degrees of freedom
## Residual deviance: 2903.6  on 3951  degrees of freedom
## AIC: 2945.6
## 
## Number of Fisher Scoring iterations: 6

Full Logistic Model \[ log(\frac{\hat \pi}{1 - \hat \pi}) = 5.059 - 0.009{Duration} - 1.472{Acousticness} + 0.311{Danceability} - 3.723{Energy} -2.809{Instrumentalness} \\ + 0.111{Key1} - 0.088{Key2} + 0.772{Key3} - 0.327{Key4} + 0.180{Key5} + 0.314{Key6} + 0.175{Key7} + 0.218{Key8} + 0.032{Key9} \\ \\ \\ + 0.152{Key10} + 0.059{Key11} + 0.350{Loudness} + 3.453{Speechiness} - 0.001{Tempo} -1.639{Valence} \]

Significant variables are the Duration, Acousticness, Energy, Loudness, Speechiness, and Valence and having Key3 (D#/E-flat Minor) instead of Key0 (C Minor).

predict on test using 0.5 cutoff

v.prob.test <- predict(v.logit.model0, newdata = logit.test0, type = "response")
v.pred.test <- ifelse(v.prob.test > 0.5, 1, 0) # using cutoff = 0.15
confusionMatrix(as.factor(v.pred.test), as.factor(logit.test0$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 606  93
##          1   1   1
##                                           
##                Accuracy : 0.8659          
##                  95% CI : (0.8384, 0.8903)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 0.5275          
##                                           
##                   Kappa : 0.0153          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.010638        
##             Specificity : 0.998353        
##          Pos Pred Value : 0.500000        
##          Neg Pred Value : 0.866953        
##              Prevalence : 0.134094        
##          Detection Rate : 0.001427        
##    Detection Prevalence : 0.002853        
##       Balanced Accuracy : 0.504495        
##                                           
##        'Positive' Class : 1               
## 

predict on optimal cutoff (0.18) for highest kappa

v.prob.test <- predict(v.logit.model0, newdata = logit.test0, type = "response")
v.pred.test <- ifelse(v.prob.test > v.opt.cut0, 1, 0) # using cutoff = 0.18
confusionMatrix(as.factor(v.pred.test), as.factor(logit.test0$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 482  67
##          1 125  27
##                                           
##                Accuracy : 0.7261          
##                  95% CI : (0.6915, 0.7588)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0645          
##                                           
##  Mcnemar's Test P-Value : 3.895e-05       
##                                           
##             Sensitivity : 0.28723         
##             Specificity : 0.79407         
##          Pos Pred Value : 0.17763         
##          Neg Pred Value : 0.87796         
##              Prevalence : 0.13409         
##          Detection Rate : 0.03852         
##    Detection Prevalence : 0.21683         
##       Balanced Accuracy : 0.54065         
##                                           
##        'Positive' Class : 1               
## 

Predict on optimal balance between sensitivity and specificity (0.14)

v.prob.test <- predict(v.logit.model0, newdata = logit.test0, type = "response")
v.pred.test <- ifelse(v.prob.test > 0.14, 1, 0) # using cutoff = 0.14
confusionMatrix(as.factor(v.pred.test), as.factor(logit.test0$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 383  42
##          1 224  52
##                                           
##                Accuracy : 0.6205          
##                  95% CI : (0.5835, 0.6566)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1013          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.55319         
##             Specificity : 0.63097         
##          Pos Pred Value : 0.18841         
##          Neg Pred Value : 0.90118         
##              Prevalence : 0.13409         
##          Detection Rate : 0.07418         
##    Detection Prevalence : 0.39372         
##       Balanced Accuracy : 0.59208         
##                                           
##        'Positive' Class : 1               
## 

Variable Selection: Stepwise (10 fold CV)

step-wise 10 fold cross validation

set.seed(123)
train_control <- trainControl(method = "cv", 
                              number = 10) 

# Fit K-fold CV model  
meh <- capture.output(step0_kcv <- train(popular ~ ., data = logit.train0, family = binomial(), 
                 method = "glmStepAIC", trControl = train_control, verbose = FALSE))
step0_kcv <- step0_kcv$finalModel
step0_kcv
## 
## Call:  NULL
## 
## Coefficients:
##      (Intercept)          duration      acousticness            energy  
##         5.200722         -0.008548         -1.448020         -3.736311  
## instrumentalness              key3              key4          loudness  
##        -2.393322          0.732785         -0.373695          0.339930  
##      speechiness           valence  
##         3.171613         -1.494016  
## 
## Degrees of Freedom: 3270 Total (i.e. Null);  3261 Residual
## Null Deviance:       2609 
## Residual Deviance: 2453  AIC: 2473
summary(step0_kcv)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3942  -0.5881  -0.4721  -0.3445   2.5732  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       5.200722   0.738641   7.041 1.91e-12 ***
## duration         -0.008548   0.001760  -4.858 1.19e-06 ***
## acousticness     -1.448020   0.374716  -3.864 0.000111 ***
## energy           -3.736311   0.600575  -6.221 4.93e-10 ***
## instrumentalness -2.393322   1.819973  -1.315 0.188499    
## key3              0.732785   0.253987   2.885 0.003913 ** 
## key4             -0.373695   0.190967  -1.957 0.050365 .  
## loudness          0.339930   0.045786   7.424 1.13e-13 ***
## speechiness       3.171613   0.744682   4.259 2.05e-05 ***
## valence          -1.494016   0.279311  -5.349 8.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2609.2  on 3270  degrees of freedom
## Residual deviance: 2452.9  on 3261  degrees of freedom
## AIC: 2472.9
## 
## Number of Fisher Scoring iterations: 6
kcv.prob.test0 <- predict(step0_kcv, newdata = key.dummy(logit.test0), type = "response")
kcv.pred.test0 <- ifelse(kcv.prob.test0 > 0.5, 1, 0) # using cutoff = 0.5
confusionMatrix(as.factor(kcv.pred.test0), as.factor(logit.test0$popular), positive = "1")

Find optimal cut off value:

#search for best cutoff
kcv.out0 <- opt.cut.func(step0_kcv, key.dummy(logit.valid0))
opt.cut.plot(kcv.out0)

kcv.out0$cutoff[which.min(kcv.out0$ssdiff.vec)]
## [1] 0.14
kcv.out0.cut0 <- kcv.out0$cutoff[which.max(kcv.out0$kappa.vec)]
kcv.out0.cut0
## [1] 0.17

Fit final model (combo of train and validation)

set.seed(123)
finalmeh <- capture.output(step0_kcv.final0 <- train(popular ~ ., data = all.train0, family = binomial(), 
                 method = "glmStepAIC", trControl = train_control, verbose = FALSE))
step0_kcv.final0 <- step0_kcv.final0$finalModel
step0_kcv.final0
## 
## Call:  NULL
## 
## Coefficients:
##      (Intercept)          duration      acousticness            energy  
##          5.37671          -0.00876          -1.52562          -3.86033  
## instrumentalness              key3              key4          loudness  
##         -2.73902           0.63876          -0.44345           0.35180  
##      speechiness           valence  
##          3.32807          -1.52259  
## 
## Degrees of Freedom: 3971 Total (i.e. Null);  3962 Residual
## Null Deviance:       3102 
## Residual Deviance: 2908  AIC: 2928
summary(step0_kcv.final0)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4190  -0.5794  -0.4622  -0.3346   2.6346  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       5.376709   0.682435   7.879 3.31e-15 ***
## duration         -0.008760   0.001591  -5.506 3.67e-08 ***
## acousticness     -1.525616   0.345842  -4.411 1.03e-05 ***
## energy           -3.860331   0.556026  -6.943 3.85e-12 ***
## instrumentalness -2.739017   1.838644  -1.490  0.13630    
## key3              0.638761   0.240138   2.660  0.00781 ** 
## key4             -0.443445   0.179981  -2.464  0.01375 *  
## loudness          0.351801   0.042604   8.258  < 2e-16 ***
## speechiness       3.328066   0.678889   4.902 9.48e-07 ***
## valence          -1.522593   0.257624  -5.910 3.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3102.1  on 3971  degrees of freedom
## Residual deviance: 2908.3  on 3962  degrees of freedom
## AIC: 2928.3
## 
## Number of Fisher Scoring iterations: 6

Stepwise 10 Fold Cross Validation:

Compared to the full logistic regression model, the stepwise model leaves out Danceability, Key1, Key2, Key5 - Key11, and Tempo.

\[ log(\frac{\hat \pi}{1 - \hat \pi}) = 5.377 - 0.009{Duration} - 1.525{Acousticness} - 3.860{Energy} - 2.739{Instrumentalness} \\ + 0.639{Key3} - 0.443{Key4} + 0.351{Loudness} + 3.328{Speechiness} - 1.522{Valence} \]

The Significant variables in the model are Duration, Acousticness, Energy, Key3 (D#/E-flat Minor), Key4 (E Minor) Loudness, Speechiness and Valence.

predict on test using 0.5 cutoff

kcv.prob.test0 <- predict(step0_kcv.final0, newdata = key.dummy(logit.test0), type = "response")
kcv.pred.test0 <- ifelse(kcv.prob.test0 > 0.5, 1, 0) # using cutoff = 0.5
confusionMatrix(as.factor(kcv.pred.test0), as.factor(logit.test0$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 606  93
##          1   1   1
##                                           
##                Accuracy : 0.8659          
##                  95% CI : (0.8384, 0.8903)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 0.5275          
##                                           
##                   Kappa : 0.0153          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.010638        
##             Specificity : 0.998353        
##          Pos Pred Value : 0.500000        
##          Neg Pred Value : 0.866953        
##              Prevalence : 0.134094        
##          Detection Rate : 0.001427        
##    Detection Prevalence : 0.002853        
##       Balanced Accuracy : 0.504495        
##                                           
##        'Positive' Class : 1               
## 

predict on test using optimal kappa, 0.17 cutoff

kcv.prob.test0 <- predict(step0_kcv.final0, newdata = key.dummy(logit.test0), type = "response")
kcv.pred.test0 <- ifelse(kcv.prob.test0 > 0.17, 1, 0) # using cutoff = 0.17
confusionMatrix(as.factor(kcv.pred.test0), as.factor(logit.test0$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 475  65
##          1 132  29
##                                          
##                Accuracy : 0.719          
##                  95% CI : (0.6841, 0.752)
##     No Information Rate : 0.8659         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.07           
##                                          
##  Mcnemar's Test P-Value : 2.572e-06      
##                                          
##             Sensitivity : 0.30851        
##             Specificity : 0.78254        
##          Pos Pred Value : 0.18012        
##          Neg Pred Value : 0.87963        
##              Prevalence : 0.13409        
##          Detection Rate : 0.04137        
##    Detection Prevalence : 0.22967        
##       Balanced Accuracy : 0.54552        
##                                          
##        'Positive' Class : 1              
## 

predict on test using 0.14 cutoff, optimal balance between sensitivity and specificity

kcv.prob.test0 <- predict(step0_kcv.final0, newdata = key.dummy(logit.test0), type = "response")
kcv.pred.test0 <- ifelse(kcv.prob.test0 > 0.14, 1, 0) # using cutoff = 0.16
confusionMatrix(as.factor(kcv.pred.test0), as.factor(logit.test0$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 397  45
##          1 210  49
##                                           
##                Accuracy : 0.6362          
##                  95% CI : (0.5994, 0.6719)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1007          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5213          
##             Specificity : 0.6540          
##          Pos Pred Value : 0.1892          
##          Neg Pred Value : 0.8982          
##              Prevalence : 0.1341          
##          Detection Rate : 0.0699          
##    Detection Prevalence : 0.3695          
##       Balanced Accuracy : 0.5877          
##                                           
##        'Positive' Class : 1               
## 

Variable Selection: All possible regression

set.seed(123)
glmulti.out0 <- glmulti(popular ~ ., data = logit.train0,
                           level = 1,               # No interaction considered
            method = "h",            # Exhaustive approach
            crit = "aic",            # AIC as criteria
            confsetsize = 5,         # Keep 5 best models
            plotty = F, report = F,  # No plot or interim reports
            fitfunction = "glm",     # lm function
            family = binomial)       # binomial to run logistic regression 
#glmulti.out0@formulas

view summary of top model

summary(glmulti.out0@objects[[1]])
## 
## Call:
## fitfunc(formula = as.formula(x), family = ..1, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3900  -0.5880  -0.4766  -0.3530   2.4475  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       5.222133   0.734876   7.106 1.19e-12 ***
## duration         -0.008472   0.001757  -4.822 1.42e-06 ***
## acousticness     -1.456307   0.374863  -3.885 0.000102 ***
## energy           -3.820021   0.597550  -6.393 1.63e-10 ***
## instrumentalness -2.385559   1.826735  -1.306 0.191582    
## loudness          0.343749   0.045699   7.522 5.39e-14 ***
## speechiness       3.090051   0.744032   4.153 3.28e-05 ***
## valence          -1.418107   0.277458  -5.111 3.20e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2609.2  on 3270  degrees of freedom
## Residual deviance: 2465.1  on 3263  degrees of freedom
## AIC: 2481.1
## 
## Number of Fisher Scoring iterations: 6

Store model

allreg.logit0 <- glmulti.out0@objects[[1]]
#search for best cutoff
allreg.out0 <- opt.cut.func(allreg.logit0, logit.valid0)
opt.cut.plot(allreg.out0)

allreg.out0$cutoff[which.min(allreg.out0$ssdiff.vec)]
## [1] 0.15
allreg.out0.cut0 <- allreg.out0$cutoff[which.max(allreg.out0$kappa.vec)]
allreg.out0.cut0
## [1] 0.14

fit final model to combo of training and validation

set.seed(123)
glmulti.out0 <- glmulti(popular ~ ., data = all.train0,
                           level = 1,               # No interaction considered
            method = "h",            # Exhaustive approach
            crit = "aic",            # AIC as criteria
            confsetsize = 5,         # Keep 5 best models
            plotty = F, report = F,  # No plot or interim reports
            fitfunction = "glm",     # lm function
            family = binomial)       # binomial to run logistic regression 
#glmulti.out1@formulas
summary(glmulti.out0@objects[[1]])
## 
## Call:
## fitfunc(formula = as.formula(x), family = ..1, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4130  -0.5781  -0.4653  -0.3432   2.4901  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       5.369288   0.678471   7.914 2.50e-15 ***
## duration         -0.008593   0.001584  -5.426 5.75e-08 ***
## acousticness     -1.537242   0.345553  -4.449 8.64e-06 ***
## energy           -3.954820   0.552814  -7.154 8.43e-13 ***
## instrumentalness -2.722274   1.849908  -1.472    0.141    
## loudness          0.354367   0.042542   8.330  < 2e-16 ***
## speechiness       3.278056   0.677868   4.836 1.33e-06 ***
## valence          -1.444874   0.255848  -5.647 1.63e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3102.1  on 3971  degrees of freedom
## Residual deviance: 2922.2  on 3964  degrees of freedom
## AIC: 2938.2
## 
## Number of Fisher Scoring iterations: 6

All Possible Regression:

Compared to the full Logistic model, the Model from All subsets regression leaves out the variables Danceability, all Key dummy variables, and Tempo.

\[ log(\frac{\hat \pi}{1 - \hat \pi}) = 5.369 - 0.009{Duration} - 1.537{Acousticness} - 3.955{Energy} - 2.722{Instrumentalness} \\ + 0.354{Loudness} + 3.278{Speechiness} - 1.445{Valence} \]

store final model

allreg.logit0.final <- glmulti.out0@objects[[1]]

Predictions with 0.5 as the cut off

allreg.prob.test0 <- predict(allreg.logit0.final, newdata = logit.test0, type = "response")
allreg.pred.test0 <- ifelse(allreg.prob.test0 > 0.5, 1, 0) # using standard cutoff 0.5
confusionMatrix(as.factor(allreg.pred.test0), as.factor(logit.test0$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 606  93
##          1   1   1
##                                           
##                Accuracy : 0.8659          
##                  95% CI : (0.8384, 0.8903)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 0.5275          
##                                           
##                   Kappa : 0.0153          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.010638        
##             Specificity : 0.998353        
##          Pos Pred Value : 0.500000        
##          Neg Pred Value : 0.866953        
##              Prevalence : 0.134094        
##          Detection Rate : 0.001427        
##    Detection Prevalence : 0.002853        
##       Balanced Accuracy : 0.504495        
##                                           
##        'Positive' Class : 1               
## 

Predictions where cut off yields the best kappa, 0.14

allreg.prob.test0 <- predict(allreg.logit0.final, newdata = logit.test0, type = "response")
allreg.pred.test0 <- ifelse(allreg.prob.test0 > allreg.out0.cut0, 1, 0) # using cutoff 0.14
confusionMatrix(as.factor(allreg.pred.test0), as.factor(logit.test0$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 404  42
##          1 203  52
##                                           
##                Accuracy : 0.6505          
##                  95% CI : (0.6139, 0.6858)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1269          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.55319         
##             Specificity : 0.66557         
##          Pos Pred Value : 0.20392         
##          Neg Pred Value : 0.90583         
##              Prevalence : 0.13409         
##          Detection Rate : 0.07418         
##    Detection Prevalence : 0.36377         
##       Balanced Accuracy : 0.60938         
##                                           
##        'Positive' Class : 1               
## 

Predictions where cut off is the best balance of sensitivity and specificity, 0.15

allreg.prob.test0 <- predict(allreg.logit0.final, newdata = logit.test0, type = "response")
allreg.pred.test0 <- ifelse(allreg.prob.test0 > 0.15, 1, 0) # using cutoff 0.15
confusionMatrix(as.factor(allreg.pred.test0), as.factor(logit.test0$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 435  51
##          1 172  43
##                                          
##                Accuracy : 0.6819         
##                  95% CI : (0.646, 0.7162)
##     No Information Rate : 0.8659         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1128         
##                                          
##  Mcnemar's Test P-Value : 9.297e-16      
##                                          
##             Sensitivity : 0.45745        
##             Specificity : 0.71664        
##          Pos Pred Value : 0.20000        
##          Neg Pred Value : 0.89506        
##              Prevalence : 0.13409        
##          Detection Rate : 0.06134        
##    Detection Prevalence : 0.30670        
##       Balanced Accuracy : 0.58704        
##                                          
##        'Positive' Class : 1              
## 

Regularized Regression: Ridge 10 fold Cross Validation

set.seed(123)
ridge0 <- train(popular ~ ., data = logit.train0, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#best tuning parameter
ridge0$bestTune
##     alpha lambda
## 100     0    0.1
ridge0.model <- coef(ridge0$finalModel, ridge0$bestTune$lambda)
ridge0.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       0.3812536494
## duration         -0.0042651992
## acousticness     -0.3997989337
## danceability     -0.1441080290
## energy           -0.6313909327
## instrumentalness -0.8210624307
## key1             -0.0140152818
## key2             -0.1557323229
## key3              0.4654206935
## key4             -0.1797053979
## key5              0.0291019833
## key6              0.0724378741
## key7              0.0623824309
## key8             -0.0095258891
## key9              0.0216445665
## key10             0.0301392242
## key11            -0.0311864381
## loudness          0.0852587410
## speechiness       1.4727602887
## tempo             0.0001278062
## valence          -0.7066431211

Search for best cutoff using validation set

ridge0.out <- reg.opt.cut.func(ridge0, logit.valid0)
opt.cut.plot(ridge0.out)

# cut off by kappa
ridge0.out$cutoff[which.max(ridge0.out$kappa.vec)]
## [1] 0.15
ridge0.out$cutoff[which.min(ridge0.out$ssdiff.vec)]
## [1] 0.14

create final model

set.seed(123)
ridge0 <- train(popular ~ ., data = all.train0, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#best tuning parameter
ridge0$bestTune
##     alpha lambda
## 100     0    0.1
ridge0.model <- coef(ridge0$finalModel, ridge0$bestTune$lambda)
ridge0.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)       0.321762194
## duration         -0.004289330
## acousticness     -0.409439651
## danceability     -0.062528153
## energy           -0.607499947
## instrumentalness -0.839357369
## key1              0.013386505
## key2             -0.091231783
## key3              0.400527208
## key4             -0.198298922
## key5              0.053748073
## key6              0.142443327
## key7              0.036589828
## key8              0.030302380
## key9             -0.041870877
## key10             0.034094167
## key11            -0.034462056
## loudness          0.084161023
## speechiness       1.615890923
## tempo            -0.000348579
## valence          -0.724448093

Ridge 10 Fold Cross Validation:

\[ log(\frac{\hat \pi}{1 - \hat \pi}) = 0.322 - 0.004{Duration} - 0.409{Acousticness} - 0.063{Danceability} - 0.607{Energy} - 0.839{Instrumentalness} \\ 0.013{Key1} - 0.091{Key2} + 0.401{Key3} - 0.053{Key4} + 0.054{Key5} + 0.142{Key6} + 0.037{Key7} + 0.030{Key8} - 0.042{Key9} \\ + 0.034{Key10} - 0.034{Key11} + 0.084{Loudness} + 1.616{Speechiness} - 0.0003{Tempo} - 0.724{Valence} \]

predict and evaluate on the test set where cutoff is at 0.5

prob.ridge0 <- predict(ridge0, s = ridge0$bestTune, logit.test0, type = "prob")
pred.ridge0 <- ifelse(prob.ridge0[,2] > 0.50, 1, 0) # using cutoff = 0.5
confusionMatrix(as.factor(pred.ridge0), as.factor(logit.test0$popular), 
                positive = "1")
## Warning in confusionMatrix.default(as.factor(pred.ridge0),
## as.factor(logit.test0$popular), : Levels are not in the same order for reference
## and data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 607  94
##          1   0   0
##                                           
##                Accuracy : 0.8659          
##                  95% CI : (0.8384, 0.8903)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 0.5275          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8659          
##              Prevalence : 0.1341          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 

predict and evaluate on the test set where cutoff is at 0.15 corresponding to optimal kappa

prob.ridg0 <- predict(ridge0, s = ridge0bestTune, logit.test0, type = "prob")
pred.ridge0 <- ifelse(prob.ridge0[,2] > 0.15, 1, 0) # using cutoff = 0.15
confusionMatrix(as.factor(pred.ridge0), as.factor(logit.test0$popular), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 464  63
##          1 143  31
##                                           
##                Accuracy : 0.7061          
##                  95% CI : (0.6709, 0.7396)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0693          
##                                           
##  Mcnemar's Test P-Value : 3.709e-08       
##                                           
##             Sensitivity : 0.32979         
##             Specificity : 0.76442         
##          Pos Pred Value : 0.17816         
##          Neg Pred Value : 0.88046         
##              Prevalence : 0.13409         
##          Detection Rate : 0.04422         
##    Detection Prevalence : 0.24822         
##       Balanced Accuracy : 0.54710         
##                                           
##        'Positive' Class : 1               
## 

predict and evaluate on the test set where cutoff is at 0.14 corresponding to optimal sensitivity and specificity balance

prob.ridg0 <- predict(ridge0, s = ridge0bestTune, logit.test0, type = "prob")
pred.ridge0 <- ifelse(prob.ridge0[,2] > 0.14, 1, 0) # using cutoff = 0.14
confusionMatrix(as.factor(pred.ridge0), as.factor(logit.test0$popular), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 403  48
##          1 204  46
##                                           
##                Accuracy : 0.6405          
##                  95% CI : (0.6037, 0.6761)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0901          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.48936         
##             Specificity : 0.66392         
##          Pos Pred Value : 0.18400         
##          Neg Pred Value : 0.89357         
##              Prevalence : 0.13409         
##          Detection Rate : 0.06562         
##    Detection Prevalence : 0.35663         
##       Balanced Accuracy : 0.57664         
##                                           
##        'Positive' Class : 1               
## 

Regularized Regression: Lasso 10 fold Cross Validation (returns with a model of just the intercept)

lasso0 <- train(popular ~ ., data = logit.train0, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
lasso0$bestTune
##     alpha lambda
## 100     1    0.1
# best coefficient
lasso0.model <- coef(lasso0$finalModel, lasso0$bestTune$lambda)
lasso0.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                          1
## (Intercept)      -1.843351
## duration          .       
## acousticness      .       
## danceability      .       
## energy            .       
## instrumentalness  .       
## key1              .       
## key2              .       
## key3              .       
## key4              .       
## key5              .       
## key6              .       
## key7              .       
## key8              .       
## key9              .       
## key10             .       
## key11             .       
## loudness          .       
## speechiness       .       
## tempo             .       
## valence           .

Search for best cutoff using validation set

lasso0.out <- reg.opt.cut.func(lasso0, logit.valid0)
opt.cut.plot(lasso0.out)

# cut off by kappa
lasso0.out$cutoff[which.max(lasso0.out$kappa.vec)]
## [1] 0
lasso0.out$cutoff[which.min(lasso0.out$ssdiff.vec)]
## [1] 0
lasso0.final <- train(popular ~ ., data = all.train0, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
lasso0.final$bestTune
##     alpha lambda
## 100     1    0.1
# best coefficient
lasso0.model.final <- coef(lasso0.final$finalModel, lasso0.final$bestTune$lambda)
lasso0.model.final
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                          1
## (Intercept)      -1.881861
## duration          .       
## acousticness      .       
## danceability      .       
## energy            .       
## instrumentalness  .       
## key1              .       
## key2              .       
## key3              .       
## key4              .       
## key5              .       
## key6              .       
## key7              .       
## key8              .       
## key9              .       
## key10             .       
## key11             .       
## loudness          .       
## speechiness       .       
## tempo             .       
## valence           .

predict and evaluate on the test set where cutoff is at 0.5

prob.lasso0 <- predict(lasso0.final, s = lasso0.final$bestTune, logit.test0, type = "prob")
pred.lasso0 <- ifelse(prob.lasso0[,2] > 0.5, 1, 0) # using cutoff = 0.5
confusionMatrix(as.factor(pred.lasso0), as.factor(logit.test0$popular), 
                positive = "1")
## Warning in confusionMatrix.default(as.factor(pred.lasso0),
## as.factor(logit.test0$popular), : Levels are not in the same order for reference
## and data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 607  94
##          1   0   0
##                                           
##                Accuracy : 0.8659          
##                  95% CI : (0.8384, 0.8903)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 0.5275          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8659          
##              Prevalence : 0.1341          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 

predict and evaluate on the test set where cutoff is at 0.11

prob.lasso0 <- predict(lasso0.final, s = lasso0.final$bestTune, logit.test0, type = "prob")
pred.lasso0 <- ifelse(prob.lasso0[,2] > 0.11, 1, 0) # using cutoff = 0.11
confusionMatrix(as.factor(pred.lasso0), as.factor(logit.test0$popular), 
                positive = "1")
## Warning in confusionMatrix.default(as.factor(pred.lasso0),
## as.factor(logit.test0$popular), : Levels are not in the same order for reference
## and data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0   0   0
##          1 607  94
##                                           
##                Accuracy : 0.1341          
##                  95% CI : (0.1097, 0.1616)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.1341          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.1341          
##          Detection Rate : 0.1341          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 

Regularized Regression: Elastic Net 10 fold Cross Validation

set.seed(123)
enet0 <- train(popular ~ ., data = logit.train0, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = seq(0,1,by = 0.05),lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
enet0$bestTune
##     alpha lambda
## 100     0    0.1
# best coefficient
enet0.model <- coef(enet0$finalModel, enet0$bestTune$lambda)
enet0.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       0.3812536494
## duration         -0.0042651992
## acousticness     -0.3997989337
## danceability     -0.1441080290
## energy           -0.6313909327
## instrumentalness -0.8210624307
## key1             -0.0140152818
## key2             -0.1557323229
## key3              0.4654206935
## key4             -0.1797053979
## key5              0.0291019833
## key6              0.0724378741
## key7              0.0623824309
## key8             -0.0095258891
## key9              0.0216445665
## key10             0.0301392242
## key11            -0.0311864381
## loudness          0.0852587410
## speechiness       1.4727602887
## tempo             0.0001278062
## valence          -0.7066431211

search for best cutoff with validation set

enet0.out <- reg.opt.cut.func(enet0, logit.valid0)
opt.cut.plot(enet0.out)

# cut off by kappa
enet0.out$cutoff[which.max(enet0.out$kappa.vec)]
## [1] 0.15
enet0.out$cutoff[which.min(enet0.out$ssdiff.vec)]
## [1] 0.14

create final model on train and validation

set.seed(123)
enet0 <- train(popular ~ ., data = all.train0, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = seq(0,1,by = 0.05),lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
enet0$bestTune
##     alpha lambda
## 100     0    0.1
# best coefficient
enet0.model <- coef(enet0$finalModel, enet0$bestTune$lambda)
enet0.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)       0.321762194
## duration         -0.004289330
## acousticness     -0.409439651
## danceability     -0.062528153
## energy           -0.607499947
## instrumentalness -0.839357369
## key1              0.013386505
## key2             -0.091231783
## key3              0.400527208
## key4             -0.198298922
## key5              0.053748073
## key6              0.142443327
## key7              0.036589828
## key8              0.030302380
## key9             -0.041870877
## key10             0.034094167
## key11            -0.034462056
## loudness          0.084161023
## speechiness       1.615890923
## tempo            -0.000348579
## valence          -0.724448093

Elastic Net 10 Cross Validation \[ log(\frac{\hat \pi}{1 - \hat \pi}) = 0.322 - 0.004{Duration} - 0.409{Acousticness} - 0.063{Deanceability} - 0.607{Energy} - 0.839{Instrumentalness} \\ + 0.013{Key1} - 0.091{Key2} + 0.401{Key3} - 0.198{Key4} + 0.054{Key5} + 0.142{Key6} + 0.037{Key7} + 0.030{Key 8} \\ - 0.041{Key9} + 0.034{Key10} - 0.034{Key11} + 0.084{Loudness} + 1.616{Speechiness} - 0.0003{Tempo} - 0.724{Valence} \]

predict and evaluate on the test set where cutoff is at 0.5

prob.enet0 <- predict(enet0, s = enet0$bestTune, logit.test0, type = "prob")
pred.enet0 <- ifelse(prob.enet0[,2] > 0.50, 1, 0) # using cutoff = 0.5
confusionMatrix(as.factor(pred.enet0), as.factor(logit.test0$popular), 
                positive = "1")
## Warning in confusionMatrix.default(as.factor(pred.enet0),
## as.factor(logit.test0$popular), : Levels are not in the same order for reference
## and data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 607  94
##          1   0   0
##                                           
##                Accuracy : 0.8659          
##                  95% CI : (0.8384, 0.8903)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 0.5275          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8659          
##              Prevalence : 0.1341          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 

predict and evaluate on the test set where cutoff is at 0.15 corresponding to optimal kappa

prob.enet0 <- predict(enet0, s = enet0$bestTune, logit.test0, type = "prob")
pred.enet0 <- ifelse(prob.enet0[,2] > 0.15, 1, 0) # using cutoff = 0.15
confusionMatrix(as.factor(pred.enet0), as.factor(logit.test0$popular), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 464  63
##          1 143  31
##                                           
##                Accuracy : 0.7061          
##                  95% CI : (0.6709, 0.7396)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0693          
##                                           
##  Mcnemar's Test P-Value : 3.709e-08       
##                                           
##             Sensitivity : 0.32979         
##             Specificity : 0.76442         
##          Pos Pred Value : 0.17816         
##          Neg Pred Value : 0.88046         
##              Prevalence : 0.13409         
##          Detection Rate : 0.04422         
##    Detection Prevalence : 0.24822         
##       Balanced Accuracy : 0.54710         
##                                           
##        'Positive' Class : 1               
## 

predict and evaluate on the test set where cutoff is at 0.14 corresponding to optimal balance of sensitivity and specificity

prob.enet0 <- predict(enet0, s = enet0$bestTune, logit.test0, type = "prob")
pred.enet0 <- ifelse(prob.enet0[,2] > 0.14, 1, 0) # using cutoff = 0.14
confusionMatrix(as.factor(pred.enet0), as.factor(logit.test0$popular), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 403  48
##          1 204  46
##                                           
##                Accuracy : 0.6405          
##                  95% CI : (0.6037, 0.6761)
##     No Information Rate : 0.8659          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0901          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.48936         
##             Specificity : 0.66392         
##          Pos Pred Value : 0.18400         
##          Neg Pred Value : 0.89357         
##              Prevalence : 0.13409         
##          Detection Rate : 0.06562         
##    Detection Prevalence : 0.35663         
##       Balanced Accuracy : 0.57664         
##                                           
##        'Positive' Class : 1               
## 

Decide on the best logistic model for Mode 0

Diagnostic measures

results.logit0 <- data.frame(model = c("Full", "Step 10CV", "All Subsets", "Ridge 10CV", "Elastic Net 10CV"),
                             cutoff = c(0.14, 0.14, 0.14, 0.14, 0.14),
                             AIC = c(2946, 2928, 2938.2, NA, NA),
                             Accuracy =c(0.6205, 0.6362, 0.6505, 0.6405, 0.6405),
                             Kappa = c(0.1013, 0.1007, 0.1269, 0.0901, 0.0901),
                             Sensitivity = c(0.55319, 0.5213, 0.55319, 0.48936, 0.48936),
                             Specificity = c(0.63097, 0.6540, 0.66557, 0.66392, 0.66392))
results.logit0
##              model cutoff    AIC Accuracy  Kappa Sensitivity Specificity
## 1             Full   0.14 2946.0   0.6205 0.1013     0.55319     0.63097
## 2        Step 10CV   0.14 2928.0   0.6362 0.1007     0.52130     0.65400
## 3      All Subsets   0.14 2938.2   0.6505 0.1269     0.55319     0.66557
## 4       Ridge 10CV   0.14     NA   0.6405 0.0901     0.48936     0.66392
## 5 Elastic Net 10CV   0.14     NA   0.6405 0.0901     0.48936     0.66392

The best model is the all subsets regression method model as it has the highest accuracy, sensitivity and specificity values.

All Possible Regression:

Compared to the full Logistic model, the Model from All subsets regression leaves out the variables Danceability, all Key dummy variables, and Tempo.

\[ log(\frac{\hat \pi}{1 - \hat \pi}) = 5.369 - 0.009{Duration} - 1.537{Acousticness} - 3.955{Energy} - 2.722{Instrumentalness} \\ + 0.354{Loudness} + 3.278{Speechiness} - 1.445{Valence} \]

Mode 1 popularity

Full Logistic Model

fitting logistic model using combo of train/valid/test, finding optimal model using training data.

# Fit logistic model on training data
v.logit.model1 <- glm(popular ~ ., family=binomial(link='logit'),data= logit.train1)

#search for best cutoff
out1 <- opt.cut.func(v.logit.model1, logit.valid1)
opt.cut.plot(out1)

out1$cutoff[which.min(out1$ssdiff.vec)]
## [1] 0.11
v.opt.cut1 <- out1$cutoff[which.max(out1$kappa.vec)]
v.opt.cut1
## [1] 0.15

Fit final model (combo of train and validation)

v.model.final1 <-  glm(popular ~ ., data=all.train1, family=binomial(link='logit'))
summary(v.model.final1)
## 
## Call:
## glm(formula = popular ~ ., family = binomial(link = "logit"), 
##     data = all.train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6592  -0.5252  -0.4249  -0.3197   2.7327  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        3.197998   0.707127   4.523 6.11e-06 ***
## duration          -0.007794   0.001373  -5.677 1.37e-08 ***
## acousticness      -0.747392   0.251627  -2.970  0.00298 ** 
## danceability       0.934948   0.422723   2.212  0.02699 *  
## energy            -3.456932   0.480703  -7.191 6.41e-13 ***
## instrumentalness -12.325417   7.332297  -1.681  0.09277 .  
## key1               0.133834   0.157571   0.849  0.39568    
## key2               0.070416   0.167327   0.421  0.67388    
## key3               0.585095   0.231789   2.524  0.01159 *  
## key4              -0.158981   0.233397  -0.681  0.49577    
## key5               0.030558   0.197109   0.155  0.87680    
## key6               0.052258   0.193471   0.270  0.78708    
## key7               0.173937   0.152944   1.137  0.25543    
## key8               0.344365   0.168571   2.043  0.04107 *  
## key9               0.274620   0.182927   1.501  0.13329    
## key10             -0.385177   0.276928  -1.391  0.16426    
## key11             -0.070512   0.212423  -0.332  0.73993    
## loudness           0.329966   0.035393   9.323  < 2e-16 ***
## speechiness        4.733615   0.579107   8.174 2.98e-16 ***
## tempo              0.002791   0.001580   1.767  0.07728 .  
## valence           -1.385900   0.250695  -5.528 3.23e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4359.9  on 6237  degrees of freedom
## Residual deviance: 4089.1  on 6217  degrees of freedom
## AIC: 4131.1
## 
## Number of Fisher Scoring iterations: 9

Full Logistic Model \[ log(\frac{\hat \pi}{1 - \hat \pi}) = 3.198 - 0.008{Duration} - 0.747{Acousticness} + 0.935{Danceability} - 3.457{Energy} - 12.325{Instrumentalness} \\ + 0.134{Key1} + 0.070{Key2} + 0.586{Key3} - 0.159{Key4} + 0.031{Key5} + 0.052{Key6} + 0.174{Key7} + 0.344{Key8} + 0.275{Key9} \\ \\ \\ - 0.385{Key10} - 0.071{Key11} + 0.330{Loudness} + 4.734{Speechiness} + 0.003{Tempo} - 1.386{Valence} \]

Variables Duration, Acousticness, Danceability, Energy, Key3 (D#/E-flat Major), Key8 (G#/A-flat Major), Loudness, Speechiness, and Valence are significant variables in the model.

predict on test using 0.5 cutoff

v.prob.test1 <- predict(v.logit.model1, newdata = logit.test1, type = "response")
v.pred.test1 <- ifelse(v.prob.test1 > 0.5, 1, 0) # using cutoff = 0.15
confusionMatrix(as.factor(v.pred.test1), as.factor(logit.test1$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 987 113
##          1   1   0
##                                           
##                Accuracy : 0.8965          
##                  95% CI : (0.8769, 0.9138)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 0.5643          
##                                           
##                   Kappa : -0.0018         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000000       
##             Specificity : 0.9989879       
##          Pos Pred Value : 0.0000000       
##          Neg Pred Value : 0.8972727       
##              Prevalence : 0.1026340       
##          Detection Rate : 0.0000000       
##    Detection Prevalence : 0.0009083       
##       Balanced Accuracy : 0.4994939       
##                                           
##        'Positive' Class : 1               
## 

predict on test using optimal cutoff (kappa), 0.15

v.prob.test1 <- predict(v.logit.model1, newdata = logit.test1, type = "response")
v.pred.test1 <- ifelse(v.prob.test1 > v.opt.cut1, 1, 0) # using cutoff = 0.15
confusionMatrix(as.factor(v.pred.test1), as.factor(logit.test1$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 817  72
##          1 171  41
##                                           
##                Accuracy : 0.7793          
##                  95% CI : (0.7536, 0.8035)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1367          
##                                           
##  Mcnemar's Test P-Value : 3.243e-10       
##                                           
##             Sensitivity : 0.36283         
##             Specificity : 0.82692         
##          Pos Pred Value : 0.19340         
##          Neg Pred Value : 0.91901         
##              Prevalence : 0.10263         
##          Detection Rate : 0.03724         
##    Detection Prevalence : 0.19255         
##       Balanced Accuracy : 0.59488         
##                                           
##        'Positive' Class : 1               
## 

predict on test using optimal cutoff (sensitivity and specificity), 0.11

v.prob.test1 <- predict(v.logit.model1, newdata = logit.test1, type = "response")
v.pred.test1 <- ifelse(v.prob.test1 > 0.11, 1, 0) # using cutoff = 0.11
confusionMatrix(as.factor(v.pred.test1), as.factor(logit.test1$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 609  43
##          1 379  70
##                                           
##                Accuracy : 0.6167          
##                  95% CI : (0.5873, 0.6455)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1018          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.61947         
##             Specificity : 0.61640         
##          Pos Pred Value : 0.15590         
##          Neg Pred Value : 0.93405         
##              Prevalence : 0.10263         
##          Detection Rate : 0.06358         
##    Detection Prevalence : 0.40781         
##       Balanced Accuracy : 0.61793         
##                                           
##        'Positive' Class : 1               
## 

Variable Selection: Stepwise (10 fold CV)

step-wise 10 fold cross validation

set.seed(123)
train_control <- trainControl(method = "cv", 
                              number = 10) 

# Fit K-fold CV model  
meh <- capture.output(step1_kcv <- train(popular ~ ., data = logit.train1, family = binomial(), 
                 method = "glmStepAIC", trControl = train_control, verbose = FALSE))
step1_kcv <- step1_kcv$finalModel
step1_kcv
## 
## Call:  NULL
## 
## Coefficients:
##      (Intercept)          duration      acousticness            energy  
##         4.073473         -0.007853         -0.844912         -3.381390  
## instrumentalness              key3              key4             key10  
##       -13.099381          0.632739         -0.501977         -0.372704  
##         loudness       speechiness           valence  
##         0.321980          5.084945         -1.236185  
## 
## Degrees of Freedom: 5136 Total (i.e. Null);  5126 Residual
## Null Deviance:       3568 
## Residual Deviance: 3357  AIC: 3379
summary(step1_kcv)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8184  -0.5225  -0.4259  -0.3238   2.7590  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        4.073473   0.632373   6.442 1.18e-10 ***
## duration          -0.007853   0.001505  -5.217 1.82e-07 ***
## acousticness      -0.844912   0.272201  -3.104  0.00191 ** 
## energy            -3.381390   0.532916  -6.345 2.22e-10 ***
## instrumentalness -13.099381   8.926829  -1.467  0.14226    
## key3               0.632739   0.221482   2.857  0.00428 ** 
## key4              -0.501977   0.254431  -1.973  0.04850 *  
## key10             -0.372704   0.268693  -1.387  0.16541    
## loudness           0.321980   0.038794   8.300  < 2e-16 ***
## speechiness        5.084945   0.633454   8.027 9.96e-16 ***
## valence           -1.236185   0.248294  -4.979 6.40e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3568.2  on 5136  degrees of freedom
## Residual deviance: 3357.5  on 5126  degrees of freedom
## AIC: 3379.5
## 
## Number of Fisher Scoring iterations: 9
#search for best cutoff
kcv.out1 <- opt.cut.func(step1_kcv, key.dummy(logit.valid1))
opt.cut.plot(kcv.out1)

kcv.out1$cutoff[which.min(kcv.out1$ssdiff.vec)]
## [1] 0.11
kcv.out1.cut1 <- kcv.out1$cutoff[which.max(kcv.out1$kappa.vec)]
kcv.out1.cut1
## [1] 0.13

Fit final model (combo of train and validation)

set.seed(123)
finalmeh <- capture.output(step1_kcv.final <- train(popular ~ ., data = all.train1, family = binomial(), 
                 method = "glmStepAIC", trControl = train_control, verbose = FALSE))
step1_kcv.final <- step1_kcv.final$finalModel
step1_kcv.final
## 
## Call:  NULL
## 
## Coefficients:
##      (Intercept)          duration      acousticness      danceability  
##         3.279565         -0.007749         -0.777317          0.976056  
##           energy  instrumentalness              key3              key8  
##        -3.477833        -12.106204          0.512832          0.267298  
##            key10          loudness       speechiness             tempo  
##        -0.459592          0.333033          4.790242          0.002819  
##          valence  
##        -1.411113  
## 
## Degrees of Freedom: 6237 Total (i.e. Null);  6225 Residual
## Null Deviance:       4360 
## Residual Deviance: 4095  AIC: 4121
summary(step1_kcv.final)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6496  -0.5252  -0.4275  -0.3218   2.7325  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        3.279565   0.694168   4.724 2.31e-06 ***
## duration          -0.007749   0.001370  -5.657 1.54e-08 ***
## acousticness      -0.777317   0.250366  -3.105   0.0019 ** 
## danceability       0.976056   0.421673   2.315   0.0206 *  
## energy            -3.477833   0.480062  -7.245 4.34e-13 ***
## instrumentalness -12.106204   7.288516  -1.661   0.0967 .  
## key3               0.512832   0.209177   2.452   0.0142 *  
## key8               0.267298   0.136787   1.954   0.0507 .  
## key10             -0.459592   0.258542  -1.778   0.0755 .  
## loudness           0.333033   0.035373   9.415  < 2e-16 ***
## speechiness        4.790242   0.577726   8.292  < 2e-16 ***
## tempo              0.002819   0.001578   1.786   0.0741 .  
## valence           -1.411113   0.249814  -5.649 1.62e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4359.9  on 6237  degrees of freedom
## Residual deviance: 4094.5  on 6225  degrees of freedom
## AIC: 4120.5
## 
## Number of Fisher Scoring iterations: 9

Stepwise 10 Fold Cross Validation:

Compared to the full Logistic model, the stepwise model does not include the variables Key1, Key2, Key4-Key7, Key9 and Key11.

\[ log(\frac{\hat \pi}{1 - \hat \pi}) = 3.280 - 0.008{Duration} - 0.777{Acousticness} + 0.976{Danceability} - 3.478{Energy} - 12.106{Instrumentalness} \\ + 0.513{Key3} - 0.267{Key8} -0.460{Key10} + 0.333{Loudness} + 4.790{Speechiness} + 0.003{Tempo} - 1.411{Valence} \\ \]

Variables Duration, Acousticness, Danceability, Energy, Key3 (D#/E-flat Major), Loudness, Speechiness, and Valence are significant variables in the model.

predict on test using 0.5 cutoff

kcv.prob.test1 <- predict(step1_kcv.final, newdata = key.dummy(logit.test1), type = "response")
kcv.pred.test1 <- ifelse(kcv.prob.test1 > 0.5, 1, 0) # using cutoff = 0.5
confusionMatrix(as.factor(kcv.pred.test1), as.factor(logit.test1$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 988 112
##          1   0   1
##                                           
##                Accuracy : 0.8983          
##                  95% CI : (0.8789, 0.9155)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 0.4854          
##                                           
##                   Kappa : 0.0158          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0088496       
##             Specificity : 1.0000000       
##          Pos Pred Value : 1.0000000       
##          Neg Pred Value : 0.8981818       
##              Prevalence : 0.1026340       
##          Detection Rate : 0.0009083       
##    Detection Prevalence : 0.0009083       
##       Balanced Accuracy : 0.5044248       
##                                           
##        'Positive' Class : 1               
## 

Using cutoff 0.13 which yields optimal kappa

kcv.prob.test1 <- predict(step1_kcv.final, newdata = key.dummy(logit.test1), type = "response")
kcv.pred.test1 <- ifelse(kcv.prob.test1 > kcv.out1.cut1, 1, 0) # using cutoff = 0.13
confusionMatrix(as.factor(kcv.pred.test1), as.factor(logit.test1$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 723  56
##          1 265  57
##                                           
##                Accuracy : 0.7084          
##                  95% CI : (0.6806, 0.7352)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1299          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.50442         
##             Specificity : 0.73178         
##          Pos Pred Value : 0.17702         
##          Neg Pred Value : 0.92811         
##              Prevalence : 0.10263         
##          Detection Rate : 0.05177         
##    Detection Prevalence : 0.29246         
##       Balanced Accuracy : 0.61810         
##                                           
##        'Positive' Class : 1               
## 

Using cut off 0.11 which yields the best balance between sensitivity and specificity.

kcv.prob.test1 <- predict(step1_kcv.final, newdata = key.dummy(logit.test1), type = "response")
kcv.pred.test1 <- ifelse(kcv.prob.test1 > 0.11, 1, 0) # using cutoff = 0.11
confusionMatrix(as.factor(kcv.pred.test1), as.factor(logit.test1$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 594  45
##          1 394  68
##                                           
##                Accuracy : 0.6013          
##                  95% CI : (0.5717, 0.6303)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0857          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.60177         
##             Specificity : 0.60121         
##          Pos Pred Value : 0.14719         
##          Neg Pred Value : 0.92958         
##              Prevalence : 0.10263         
##          Detection Rate : 0.06176         
##    Detection Prevalence : 0.41962         
##       Balanced Accuracy : 0.60149         
##                                           
##        'Positive' Class : 1               
## 

Variable Selection: All possible regression

set.seed(123)
glmulti.out1 <- glmulti(popular ~ ., data = logit.train1,
                           level = 1,               # No interaction considered
            method = "h",            # Exhaustive approach
            crit = "aic",            # AIC as criteria
            confsetsize = 5,         # Keep 5 best models
            plotty = F, report = F,  # No plot or interim reports
            fitfunction = "glm",     # lm function
            family = binomial)       # binomial to run logistic regression 
#glmulti.out1@formulas

view summary of top model

summary(glmulti.out1@objects[[1]])
## 
## Call:
## fitfunc(formula = as.formula(x), family = ..1, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8240  -0.5209  -0.4300  -0.3327   2.7606  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        4.019309   0.628360   6.397 1.59e-10 ***
## duration          -0.007683   0.001494  -5.143 2.71e-07 ***
## acousticness      -0.831974   0.270427  -3.077  0.00209 ** 
## energy            -3.389625   0.529955  -6.396 1.59e-10 ***
## instrumentalness -13.175893   9.028253  -1.459  0.14445    
## loudness           0.320576   0.038525   8.321  < 2e-16 ***
## speechiness        5.094770   0.631621   8.066 7.25e-16 ***
## valence           -1.227177   0.247290  -4.963 6.96e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3568.2  on 5136  degrees of freedom
## Residual deviance: 3372.1  on 5129  degrees of freedom
## AIC: 3388.1
## 
## Number of Fisher Scoring iterations: 9

Store model

allreg.logit1 <- glmulti.out1@objects[[1]]
#search for best cutoff
allreg.out1 <- opt.cut.func(allreg.logit1, logit.valid1)
opt.cut.plot(allreg.out1)

allreg.out1$cutoff[which.min(allreg.out1$ssdiff.vec)]
## [1] 0.11
allreg.out1.cut1 <- allreg.out1$cutoff[which.max(allreg.out1$kappa.vec)]
allreg.out1.cut1
## [1] 0.16

fit final model to combo of training and validation

set.seed(123)
glmulti.out1 <- glmulti(popular ~ ., data = all.train1,
                           level = 1,               # No interaction considered
            method = "h",            # Exhaustive approach
            crit = "aic",            # AIC as criteria
            confsetsize = 5,         # Keep 5 best models
            plotty = F, report = F,  # No plot or interim reports
            fitfunction = "glm",     # lm function
            family = binomial)       # binomial to run logistic regression 
#glmulti.out1@formulas
summary(glmulti.out1@objects[[1]])
## 
## Call:
## fitfunc(formula = as.formula(x), family = ..1, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6717  -0.5243  -0.4296  -0.3299   2.7251  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        3.289296   0.692110   4.753 2.01e-06 ***
## duration          -0.007681   0.001366  -5.624 1.86e-08 ***
## acousticness      -0.750350   0.249223  -3.011  0.00261 ** 
## danceability       0.942373   0.420840   2.239  0.02514 *  
## energy            -3.483207   0.478705  -7.276 3.43e-13 ***
## instrumentalness -11.929151   7.266196  -1.642  0.10065    
## loudness           0.333199   0.035260   9.450  < 2e-16 ***
## speechiness        4.822191   0.576504   8.365  < 2e-16 ***
## tempo              0.002978   0.001572   1.894  0.05822 .  
## valence           -1.405615   0.249178  -5.641 1.69e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4359.9  on 6237  degrees of freedom
## Residual deviance: 4107.6  on 6228  degrees of freedom
## AIC: 4127.6
## 
## Number of Fisher Scoring iterations: 9

All Possible Regression Compared to the full model, the All subsets regression model does not include any of the Key dummy variables.

$$

log() = 3.289 - 0.008{Duration} - 0.750{Acousticness} + 0.942{Danceability} - 3.483{Energy} - 11.929{Instrumentalness} \ + 0.333{Loudness} + 4.822{Speechiness} + 0.003{Tempo} - 1.406{Valence}

$$

Variables Duration, Acousticness, Danceability, Energy, Loudness, Speechiness, and Valence are significant in the model

store final model

allreg.logit1.final <- glmulti.out1@objects[[1]]

Predictions with 0.5 as the cut off

allreg.prob.test1 <- predict(allreg.logit1.final, newdata = logit.test1, type = "response")
allreg.pred.test1 <- ifelse(allreg.prob.test1 > 0.5, 1, 0) # using standard cutoff 0.5
confusionMatrix(as.factor(allreg.pred.test1), as.factor(logit.test1$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 988 112
##          1   0   1
##                                           
##                Accuracy : 0.8983          
##                  95% CI : (0.8789, 0.9155)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 0.4854          
##                                           
##                   Kappa : 0.0158          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0088496       
##             Specificity : 1.0000000       
##          Pos Pred Value : 1.0000000       
##          Neg Pred Value : 0.8981818       
##              Prevalence : 0.1026340       
##          Detection Rate : 0.0009083       
##    Detection Prevalence : 0.0009083       
##       Balanced Accuracy : 0.5044248       
##                                           
##        'Positive' Class : 1               
## 

Predictions where cut off is the best kappa, 0.16

allreg.prob.test1 <- predict(allreg.logit1.final, newdata = logit.test1, type = "response")
allreg.pred.test1 <- ifelse(allreg.prob.test1 > allreg.out1.cut1, 1, 0) # using optimal cutoff 0.16
confusionMatrix(as.factor(allreg.pred.test1), as.factor(logit.test1$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 841  77
##          1 147  36
##                                         
##                Accuracy : 0.7965        
##                  95% CI : (0.7715, 0.82)
##     No Information Rate : 0.8974        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.1332        
##                                         
##  Mcnemar's Test P-Value : 4.022e-06     
##                                         
##             Sensitivity : 0.3186        
##             Specificity : 0.8512        
##          Pos Pred Value : 0.1967        
##          Neg Pred Value : 0.9161        
##              Prevalence : 0.1026        
##          Detection Rate : 0.0327        
##    Detection Prevalence : 0.1662        
##       Balanced Accuracy : 0.5849        
##                                         
##        'Positive' Class : 1             
## 

Predictions where cut off is the best balance of sensitivity and specificity, 0.11

allreg.prob.test1 <- predict(allreg.logit1.final, newdata = logit.test1, type = "response")
allreg.pred.test1 <- ifelse(allreg.prob.test1 > 0.11, 1, 0) # using optimal cutoff  0.11
confusionMatrix(as.factor(allreg.pred.test1), as.factor(logit.test1$popular), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 594  44
##          1 394  69
##                                           
##                Accuracy : 0.6022          
##                  95% CI : (0.5726, 0.6312)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0893          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.61062         
##             Specificity : 0.60121         
##          Pos Pred Value : 0.14903         
##          Neg Pred Value : 0.93103         
##              Prevalence : 0.10263         
##          Detection Rate : 0.06267         
##    Detection Prevalence : 0.42053         
##       Balanced Accuracy : 0.60592         
##                                           
##        'Positive' Class : 1               
## 

Regularized Regression: Ridge 10 fold Cross Validation

set.seed(123)
ridge1 <- train(popular ~ ., data = logit.train1, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#best tuning parameter
ridge1$bestTune
##     alpha lambda
## 100     0    0.1
ridge1.model <- coef(ridge1$finalModel, ridge1$bestTune$lambda)
ridge1.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)      -0.940884985
## duration         -0.003302624
## acousticness     -0.107972712
## danceability     -0.008980677
## energy           -0.260064046
## instrumentalness -0.827713363
## key1              0.046091952
## key2              0.008592692
## key3              0.349398913
## key4             -0.189817749
## key5              0.003945898
## key6             -0.027278685
## key7              0.091414861
## key8              0.090088954
## key9              0.089685627
## key10            -0.163351622
## key11            -0.024566167
## loudness          0.062462753
## speechiness       2.309752435
## tempo             0.001341193
## valence          -0.487758435

Search for best cutoff using validation set

ridge1.out <- reg.opt.cut.func(ridge1, logit.valid1)
opt.cut.plot(ridge1.out)

# cut off by kappa
ridge1.out$cutoff[which.max(ridge1.out$kappa.vec)]
## [1] 0.12
ridge1.out$cutoff[which.min(ridge1.out$ssdiff.vec)]
## [1] 0.11

create final model

set.seed(123)
ridge1 <- train(popular ~ ., data = all.train1, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#best tuning parameter
ridge1$bestTune
##     alpha lambda
## 100     0    0.1
ridge1.model <- coef(ridge1$finalModel, ridge1$bestTune$lambda)
ridge1.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)      -1.031119627
## duration         -0.003368704
## acousticness     -0.103141402
## danceability      0.159103462
## energy           -0.270068249
## instrumentalness -0.817245806
## key1              0.026882836
## key2              0.007008190
## key3              0.266050136
## key4             -0.114394579
## key5             -0.025139220
## key6             -0.032287416
## key7              0.078237033
## key8              0.138366721
## key9              0.130311371
## key10            -0.214434940
## key11            -0.063822095
## loudness          0.064513136
## speechiness       2.177666732
## tempo             0.001583618
## valence          -0.474359450

Ridge 10 Fold Cross Validation \[ log(\frac{\hat \pi}{1 - \hat \pi}) = -1.031 - 0.003{Duration} - 0.103{Acousticness} + 0.159{Danceability} - 0.270{Energy} - 0.817{Instrumentalness} \\ + 0.027{Key1} + 0.007{Key2} + 0.266{Key3} - 0.114{Key4} - 0.025{Key5} - 0.032{Key6} + 0.078{Key7} + 0.138{Key8} + 0.130{Key9} \\ \\ \\ - 0.214{Key10} - 0.064{Key11} + 0.065{Loudness} + 2.178{Speechiness} + 0.002{Tempo} - 0.474{Valence} \]

predict and evaluate on the test set where cutoff is at 0.5

prob.ridge1 <- predict(ridge1, s = ridge1$bestTune, logit.test1, type = "prob")
pred.ridge1 <- ifelse(prob.ridge1[,2] > 0.50, 1, 0) # using cutoff = 0.5
confusionMatrix(as.factor(pred.ridge1), as.factor(logit.test1$popular), 
                positive = "1")
## Warning in confusionMatrix.default(as.factor(pred.ridge1),
## as.factor(logit.test1$popular), : Levels are not in the same order for reference
## and data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 988 113
##          1   0   0
##                                           
##                Accuracy : 0.8974          
##                  95% CI : (0.8779, 0.9147)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 0.525           
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8974          
##              Prevalence : 0.1026          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 

predict and evaluate on the test set where cutoff is at 0.12 corresponding to optimal kappa

prob.ridge1 <- predict(ridge1, s = ridge1$bestTune, logit.test1, type = "prob")
pred.ridge1 <- ifelse(prob.ridge1[,2] > 0.12, 1, 0) # using cutoff = 0.12
confusionMatrix(as.factor(pred.ridge1), as.factor(logit.test1$popular), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 703  60
##          1 285  53
##                                          
##                Accuracy : 0.6866         
##                  95% CI : (0.6583, 0.714)
##     No Information Rate : 0.8974         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.096          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.46903        
##             Specificity : 0.71154        
##          Pos Pred Value : 0.15680        
##          Neg Pred Value : 0.92136        
##              Prevalence : 0.10263        
##          Detection Rate : 0.04814        
##    Detection Prevalence : 0.30699        
##       Balanced Accuracy : 0.59028        
##                                          
##        'Positive' Class : 1              
## 

predict and evaluate on the test set where cutoff is at 0.11 corresponding to optimal balance of sensitivity and specificity

prob.ridge1 <- predict(ridge1, s = ridge1$bestTune, logit.test1, type = "prob")
pred.ridge1 <- ifelse(prob.ridge1[,2] > 0.11, 1, 0) # using cutoff = 0.13
confusionMatrix(as.factor(pred.ridge1), as.factor(logit.test1$popular), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 557  36
##          1 431  77
##                                          
##                Accuracy : 0.5758         
##                  95% CI : (0.546, 0.6053)
##     No Information Rate : 0.8974         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0962         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.68142        
##             Specificity : 0.56377        
##          Pos Pred Value : 0.15157        
##          Neg Pred Value : 0.93929        
##              Prevalence : 0.10263        
##          Detection Rate : 0.06994        
##    Detection Prevalence : 0.46140        
##       Balanced Accuracy : 0.62259        
##                                          
##        'Positive' Class : 1              
## 

Regularized Regression: Lasso 10 fold Cross Validation (returns with a model of just the intercept)

lasso1 <- train(popular ~ ., data = logit.train1, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
lasso1$bestTune
##     alpha lambda
## 100     1    0.1
# best coefficient
lasso1.model <- coef(lasso1$finalModel, lasso1$bestTune$lambda)
lasso1.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                          1
## (Intercept)      -2.086909
## duration          .       
## acousticness      .       
## danceability      .       
## energy            .       
## instrumentalness  .       
## key1              .       
## key2              .       
## key3              .       
## key4              .       
## key5              .       
## key6              .       
## key7              .       
## key8              .       
## key9              .       
## key10             .       
## key11             .       
## loudness          .       
## speechiness       .       
## tempo             .       
## valence           .

Search for best cutoff using validation set

lasso1.out <- reg.opt.cut.func(lasso1, logit.valid1)
opt.cut.plot(lasso1.out)

# cut off by kappa
lasso1.out$cutoff[which.max(lasso1.out$kappa.vec)]
## [1] 0
lasso1.out$cutoff[which.min(lasso1.out$ssdiff.vec)]
## [1] 0
lasso1.final <- train(popular ~ ., data = all.train1, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
lasso1.final$bestTune
##     alpha lambda
## 100     1    0.1
# best coefficient
lasso1.model.final <- coef(lasso1.final$finalModel, lasso1.final$bestTune$lambda)
lasso1.model.final
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                          1
## (Intercept)      -2.076379
## duration          .       
## acousticness      .       
## danceability      .       
## energy            .       
## instrumentalness  .       
## key1              .       
## key2              .       
## key3              .       
## key4              .       
## key5              .       
## key6              .       
## key7              .       
## key8              .       
## key9              .       
## key10             .       
## key11             .       
## loudness          .       
## speechiness       .       
## tempo             .       
## valence           .

predict and evaluate on the test set where cutoff is at 0.5

prob.lasso1 <- predict(lasso1.final, s = lasso1.final$bestTune, logit.test1, type = "prob")
pred.lasso1 <- ifelse(prob.lasso1[,2] > 0.5, 1, 0) # using cutoff = 0.5
confusionMatrix(as.factor(pred.lasso1), as.factor(logit.test1$popular), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 988 113
##          1   0   0
##                                           
##                Accuracy : 0.8974          
##                  95% CI : (0.8779, 0.9147)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 0.525           
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8974          
##              Prevalence : 0.1026          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 

predict and evaluate on the test set where cutoff is at 0.11

prob.lasso1 <- predict(lasso1.final, s = lasso1.final$bestTune, logit.test1, type = "prob")
pred.lasso1 <- ifelse(prob.lasso1[,2] > 0.11, 1, 0) # using cutoff = 0.11
confusionMatrix(as.factor(pred.lasso1), as.factor(logit.test1$popular), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0   0   0
##          1 988 113
##                                           
##                Accuracy : 0.1026          
##                  95% CI : (0.0853, 0.1221)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.1026          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.1026          
##          Detection Rate : 0.1026          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 

Regularized Regression: Elastic Net 10 fold Cross Validation

set.seed(123)
enet1 <- train(popular ~ ., data = logit.train1, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = seq(0,1,by = 0.05),lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
enet1$bestTune
##     alpha lambda
## 100     0    0.1
# best coefficient
enet1.model <- coef(enet1$finalModel, enet1$bestTune$lambda)
enet1.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)      -0.940884985
## duration         -0.003302624
## acousticness     -0.107972712
## danceability     -0.008980677
## energy           -0.260064046
## instrumentalness -0.827713363
## key1              0.046091952
## key2              0.008592692
## key3              0.349398913
## key4             -0.189817749
## key5              0.003945898
## key6             -0.027278685
## key7              0.091414861
## key8              0.090088954
## key9              0.089685627
## key10            -0.163351622
## key11            -0.024566167
## loudness          0.062462753
## speechiness       2.309752435
## tempo             0.001341193
## valence          -0.487758435

search for best cutoff with validation set

enet1.out <- reg.opt.cut.func(enet1, logit.valid1)
opt.cut.plot(enet1.out)

# cut off by kappa
enet1.out$cutoff[which.max(enet1.out$kappa.vec)]
## [1] 0.12
enet1.out$cutoff[which.min(enet1.out$ssdiff.vec)]
## [1] 0.11

create final model on train and validation

set.seed(123)
enet1 <- train(popular ~ ., data = all.train1, method = "glmnet",
                      family = "binomial", trControl = train_control, 
                      tuneGrid = expand.grid(alpha = seq(0,1,by = 0.05),lambda = seq(0.001,0.1,by = 0.001)))
# best parameter
enet1$bestTune
##     alpha lambda
## 100     0    0.1
# best coefficient
enet1.model <- coef(enet1$finalModel, enet1$bestTune$lambda)
enet1.model
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)      -1.031119627
## duration         -0.003368704
## acousticness     -0.103141402
## danceability      0.159103462
## energy           -0.270068249
## instrumentalness -0.817245806
## key1              0.026882836
## key2              0.007008190
## key3              0.266050136
## key4             -0.114394579
## key5             -0.025139220
## key6             -0.032287416
## key7              0.078237033
## key8              0.138366721
## key9              0.130311371
## key10            -0.214434940
## key11            -0.063822095
## loudness          0.064513136
## speechiness       2.177666732
## tempo             0.001583618
## valence          -0.474359450

Elastic Net 10 Cross Validation \[ log(\frac{\hat \pi}{1 - \hat \pi}) = -1.031 - 0.003{Duration} - 0.103{Acousticness} + 0.159{Deanceability} - 0.270{Energy} - 0.817{Instrumentalness} \\ + 0.027{Key1} + 0.007{Key2} + 0.266{Key3} - 0.114{Key4} - 0.025{Key5} - 0.032{Key6} + 0.078{Key7} + 0.138{Key 8} \\ + 0.130{Key9} - 0.214{Key10} - 0.064{Key11} + 0.065{Loudness} + 2.178{Speechiness} + 0.002{Tempo} - 0.474{Valence} \]

predict and evaluate on the test set where cutoff is at 0.5

prob.enet1 <- predict(enet1, s = enet1$bestTune, logit.test1, type = "prob")
pred.enet1 <- ifelse(prob.enet1[,2] > 0.50, 1, 0) # using cutoff = 0.5
confusionMatrix(as.factor(pred.enet1), as.factor(logit.test1$popular), 
                positive = "1")
## Warning in confusionMatrix.default(as.factor(pred.enet1),
## as.factor(logit.test1$popular), : Levels are not in the same order for reference
## and data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 988 113
##          1   0   0
##                                           
##                Accuracy : 0.8974          
##                  95% CI : (0.8779, 0.9147)
##     No Information Rate : 0.8974          
##     P-Value [Acc > NIR] : 0.525           
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8974          
##              Prevalence : 0.1026          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 

predict and evaluate on the test set where cutoff is at 0.12 corresponding to optimal kappa

prob.enet1 <- predict(enet1, s = enet1$bestTune, logit.test1, type = "prob")
pred.enet1 <- ifelse(prob.enet1[,2] > 0.12, 1, 0) # using cutoff = 0.12
confusionMatrix(as.factor(pred.enet1), as.factor(logit.test1$popular), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 703  60
##          1 285  53
##                                          
##                Accuracy : 0.6866         
##                  95% CI : (0.6583, 0.714)
##     No Information Rate : 0.8974         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.096          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.46903        
##             Specificity : 0.71154        
##          Pos Pred Value : 0.15680        
##          Neg Pred Value : 0.92136        
##              Prevalence : 0.10263        
##          Detection Rate : 0.04814        
##    Detection Prevalence : 0.30699        
##       Balanced Accuracy : 0.59028        
##                                          
##        'Positive' Class : 1              
## 

predict and evaluate on the test set where cutoff is at 0.11 corresponding to optimal balance of sensitivity and specificity

prob.enet1 <- predict(enet1, s = enet1$bestTune, logit.test1, type = "prob")
pred.enet1 <- ifelse(prob.enet1[,2] > 0.11, 1, 0) # using cutoff = 0.13
confusionMatrix(as.factor(pred.enet1), as.factor(logit.test1$popular), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 557  36
##          1 431  77
##                                          
##                Accuracy : 0.5758         
##                  95% CI : (0.546, 0.6053)
##     No Information Rate : 0.8974         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0962         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.68142        
##             Specificity : 0.56377        
##          Pos Pred Value : 0.15157        
##          Neg Pred Value : 0.93929        
##              Prevalence : 0.10263        
##          Detection Rate : 0.06994        
##    Detection Prevalence : 0.46140        
##       Balanced Accuracy : 0.62259        
##                                          
##        'Positive' Class : 1              
## 

Decide on the best logistic model for Mode 1

Diagnostic measures

results.logit1 <- data.frame(model = c("Full", "Step 10CV", "All Subsets", "Ridge 10CV", "Elastic Net 10CV"),
                             cutoff = c(0.11, 0.11, 0.11, 0.11, 0.11),
                             AIC = c(4131.1, 4121, 4127.6, NA, NA),
                             Accuracy =c(0.6167, 0.6013, 0.6022, 0.5758, 0.5758),
                             Kappa = c(0.1018, 0.0857, 0.0893, 0.0962, 0.0962),
                             Sensitivity = c(0.61947, 0.60177, 0.61062, 0.68142, 0.68142),
                             Specificity = c(0.61640, 0.60121, 0.60121, 0.56377, 0.56377))
results.logit1
##              model cutoff    AIC Accuracy  Kappa Sensitivity Specificity
## 1             Full   0.11 4131.1   0.6167 0.1018     0.61947     0.61640
## 2        Step 10CV   0.11 4121.0   0.6013 0.0857     0.60177     0.60121
## 3      All Subsets   0.11 4127.6   0.6022 0.0893     0.61062     0.60121
## 4       Ridge 10CV   0.11     NA   0.5758 0.0962     0.68142     0.56377
## 5 Elastic Net 10CV   0.11     NA   0.5758 0.0962     0.68142     0.56377

The Full logistic model yields the best performance in Accuracy and optimal balance between sensitivity and specificity.

Full Logistic Model \[ log(\frac{\hat \pi}{1 - \hat \pi}) = 3.198 - 0.008{Duration} - 0.747{Acousticness} + 0.935{Danceability} - 3.457{Energy} - 12.325{Instrumentalness} \\ + 0.134{Key1} + 0.070{Key2} + 0.586{Key3} - 0.159{Key4} + 0.031{Key5} + 0.052{Key6} + 0.174{Key7} + 0.344{Key8} + 0.275{Key9} \\ \\ \\ - 0.385{Key10} - 0.071{Key11} + 0.330{Loudness} + 4.734{Speechiness} + 0.003{Tempo} - 1.386{Valence} \]

Variables Duration, Acousticness, Danceability, Energy, Key3 (D#/E-flat Major), Key8 (G#/A-flat Major), Loudness, Speechiness, and Valence are significant variables in the model.